AD-A261  404 


NSWCOO/MP-92/304 


PROCEEDINGS  OF  THE  1992  COMPLEX  SYSTEMS 
ENGINEERING  SYNTHESIS  AND  ASSESSMENT 
TECHNOLOGY  WORKSHOP 


CUONG  NGUYEN,  Coordinator 


UNDERWATER  SYSTEMS  OERARTMENT 

20-24  JULY  1992 


Approved  for  public  ral««e;  distribution  is  unlimitad. 


NAVAL  SURFACE  WARFARE  CENTER 

OaMgrtn.  Vft9into  2a44t-SOOO  •  Siivtr  Spring,  IMarytend  20M3-S0M 


93-04839 

_ ..Mt  imm  iufl  tiU 


I 


NSWCDD/MP-92/304 


PROCEEDINGS  OF  THE  1992  COMPLEX  SYSTEMS 
ENGINEERING  SYNTHESIS  AND  ASSESSMENT 
TECHNOLOGY  WORKSHOP 


CUONG  M.  NGUYEN.  Coordinator 
UNDERWATER  SYSTEMS  DEPARTMENT 


20-24  JULY  1992 


Approved  for  public  release;  distribution  is  unlimited 


NAVAL  SURFACE  WARFARE  CENTER 

Oahlgren,  Virginia  22448-5000  •  Silver  Spring.  Maryland  20903-5000 


FOREWORD 


1992  COMPLEX  SYSTEMS  ENGINEERING  SYNTHESIS  AND 
ASSESSMENT  TECHNOLOGY  WORKSHOP  (CSESAW  *92) 


Mission  critical  computer  (MCC)  systems  in  the  Department  of  Defense,  and  m 
particular,  the  Navy,  are  extremely  large  and  complex,  controlling  a  wide  variety  of 
assets  operating  in  many  unforeseeable  situations.  These  systems  have  hard  real 
time,  stringent  fault  tolerance  and  intensive  security  requirements.  They  are 
typically  implemented  on  a  combination  of  parallel  and  distributed  architectures  In 
addition,  these  systems  are  generally  embedded  within  a  human  organization 
structure  and/or  have  human  operators  in  the  loop. 

The  emphasis  of  CSESAW  (pronounced  seesaw)  ‘92  is  on  exploring  system-level 
design  synthesis  and  assessment  capabilities  for  MCC  systems.  These  capabilities  will 
facilitate  the  development  of  such  systems  from  informal  system  requirements, 
through  the  design  phase  prototyping,  and  into  implementation  and  post 
deployment.  Component  products  produced  by  these  capabilities  are  specifications 
that  subenvironments  (e.g.,  hardware  engineering  environment  (HWEE),  software 
engineering  environment  (SEE),  and  human  computer  interaction  engineering 
environment  (HCIEE)]  will  receive.  The  focus  of  this  workshop  is  the  development 
and  integration  of  these  multiple  technologies  and  the  exploration  of  the  creation 
of  a  system-level  engineering  discipline  with  support  technologies  to  provide 
potential  high  payoff  solutions  to  the  difficult  problems  encountered  by  designers, 
developers,  and  maintainers  of  hard  real-time  MCC  systems.  The  emphasis  is  on 
resolving  system-level  technology  issues  that  cut  across  component  boundaries,  such 
as  those  associated  with  system  behavior  requirements  of  real  time,  fault  tolerance, 
and  security. 

Formidable  challenges  await  the  technology  developers.  First,  there  is  a  need 
to  establish  strong  scientific  and  engineering  foundations  with  its  associated 
technological  advances  in  methodologies,  processes,  techniques,  and  supporting 
mechanizations.  Since  it  is  systems  that  need  to  be  engineered,  a  system  perspective 
should  be  followed,  allowing  the  orchestration,  interaction,  and  integration  of  the 
engineering  of  system  components. 

Second,  the  technologies  and  capabilities  need  to  be  integrated  with  the  rest 
of  the  engineering  process.  Therefore,  the  capability  to  provide  tight  linkages  to 
detailed  design  evaluation,  systems  forward  engineering,  and  systems  reengineering 
must  be  developed,  ultimately  providing  a  seamless  overall  engineering  process.  A 


significant  amount  of  effort  has  been  put  into  component  technologies,  such  as 
hardware,  microelectronics,  memory,  databases,  software,  man  machine  interface, 
etc.  Major  strides  have  been  made  in  these  areas  in  the  last  few  years,  however,  the 
formal,  systematic  integration  and  engineering  of  these  components  into  an  overall 
system  has  lagged  far  behind.  For  hard  real-time  MCC  systems  with  high  fault 
tolerance  and  security  requirements,  the  problem  is  especially  acute  This  is  a  direct 
result  of  a  lack  of  a  system-level  engineering  methodology. 

Last,  understanding  the  needs  of  the  application  communities  and  the 
customers  is  of  utmost  importance  to  the  successful  transition  of  the  technologies 
Demonstration  of  the  correct  technological  capabilities  must  be  based  on  the  correct 
scale,  context,  and  scope  of  the  target  applications.  Correct  scale  implies  that  the 
technology  can  be  applied  to  the  problem  size  the  application  calls  for  The  correct 
context  means  the  technology  is  specific  (or  general)  enough  for  the  application 
domain.  The  correct  scope  dictates  that  the  technology  addresses  the  nonfunctional 
attributes  (real-time,  fault  tolerance,  etc.)  that  the  particular  application  requires 

It  is  in  the  spirit  of  working  to  meet  these  challenges  that  we  welcome  you  to 
this  workshop.  We  hope  to  provide  in  the  workshop  an  atmosphere  in  which  the 
participants,  including  technology  developers,  researchers,  users,  and  customers,  can 
meet,  interact,  and  exchange  ideas  on  relevant  issues.  In  the  near  future,  we  hope  to 
be  able  to  say  that  this  workshop  was  the  beginning  of  a  new  generation  in  systems 
design  synthesis. 

This  workshop  would  not  have  been  possible  without  the  hard  work  of  many 
people,  including  the  workshop,  program,  and  advisory  committees;  authors; 
presenters  of  the  submitted  papers;  panel  members,  workshop  attendants;  panel 
chairs;  and  breakout  session  chairs.  A  very  warm  "thank  you"  Is  extended  to  alt.  In 
particular,  we  wish  to  acknowledge  Michael  Edwards,  Ngocdung  Hoang,  Cuong 
Nguyen,  Michael  Jenkins,  Chuck  Sadek,  Kathy  Lederer,  Adrien  Meskin,  Dong  Choi, 
and  Janet  Higgins.  Finally,  a  special  thanxs  goes  to  Elizabeth  E.  Wald  and  COR  Grace 
Thompson  of  the  Office  of  Naval  Technology  for  tirelessly  working  for  and 
supporting  the  technology  developments  in  this  important  area. 

We  hope  you  have  a  productive  and  enjoyable  workshop! 


Steven  L.  Howell 
Assistant  Workshop  Chairman 


Philip  Q,  Hwang 
Workshop  General  Chairman 
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University  of  California,  San  Diego 
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Abstract 

This  paper  describes  a  new  kind  of  distributed  system  service,  the  Avail¬ 
ability  Management  service,  responsible  for  ensuring  that  the  mission  critical 
services  of  a  distributed  system  remain  continuously  available  to  users  despite 
node  removals  and  node  restarts  caused  by  failures,  maintenance  and  growth. 
The  presentation  stresses  the  main  ideas  behind  this  new  service,  and  outlines 
a  simple  design  that  depends  upon  the  existence  of  synchronous  membership 
and  atomic  broadcast  group  communication  services.  Extensions  of  this  initial 
design  to  deal  with  asynchronous  group  communication  services  are  also  briefly 
discussed. 
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1  Introduction 


With  the  ever  increasing  dependence  Oil  coiiipiuing  ser\ nes.  ihcii  awelabin!;,  m  the 
presence  of  failures,  maintenance  and  horizontal  growth  hi/comes  of  j)araniO!]u!  im 
portance.  In  present  systems,  responsibility  lor  recontiguring  a  Nystein  after  hnluii  ' 
or  removal  of  processors  for  maintenance  rests  mostly  with  the  human  systirm  oper 
ators.  Humans  tend  to  have  fairly  slow  reaction  times,  and  tins  can  te.-ull  in  lejigtliv 
intervals  during  which  critical  services  will  be  unavailable.  Also,  huinans  are  iioton 
ous  for  making  mistakes,  especially  when  under  stress,  and  the  mistakes  made  while 
attempting  repair  actions  can  lead  to  further  failures,  causing  further  unavailability 
For  example,  [GraySC]  reports  that  42  %  of  the  failures  in  the  raiidem  distributed 
systems  arc  caused  by  human  mistakes  made  during  inainteiiain  e.  operation  and 
configuration. 

To  ensure  automatic  reconfiguration  in  the  presence  of  failures  and  niaxinuz<*  the 
availability  of  critical  services,  the  Advanced  Automation  .System  [('DD'Jj,  luiill  hu 
supporting  US  air  traffic  control  in  the  21st  century,  ustcs  a  new  Availability  Manage 
merit  service,  that  automatically  reconfigures  .servers  implementing  critical  siu  vire.s  in 
the  presence  of  processor  removals  caused  by  failures  and  iiiaiiilenance  and  pioc<-,',sor 
additions  caused  by  restart,  repair  and  horizontal  growth. 

The  purpose  of  this  paper  is  to  explain  in  a  pedagical  inaiiner  I  In'  main  ideas  be¬ 
hind  this  new  service  and  outline  a  simple  way  of  implementing  such  a  service.  Our 
presentation  sacrifices  the  descriirtion  of  many  of  the  details  involved  in  a  realistic 
Availability  Management  service  design  for  the  purpose  of  making  the  concepts  on 
w’hich  the  service  and  its  design  are  based  easily  understandable,  'lb  this  «'nd,  we 
focus  on  designing  our  Availability  Management  service  on  top  of  an  ca.sy  to  under¬ 
stand  synchronous  communication  environment  and  we  deal  with  only  one  kind  of 
service  availability  policy.  We  conclude  by  discussing  how'  our  initial  specification  and 
design  can  be  extended  to  deal  with  asynchronous  systems  subject  to  partitioning 
well  as  with  other  kinds  of  service  availability  policies. 

We  begin  by  introducing  our  basic  notions,  system  structure  and  assumptions  and 
by  stating  the  requirements  that  a  Service  Availability  Management  service  should 
satisfy.  A  replicated  implementation  is  then  briefly  described.  A  discussion  of  possible 
extensions  concludes  the  paper. 
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2  Basic  Concepts 


Every  computing  system  is  built  to  provide  services  to  its  users.  A  st  rcict  provided  to 
a  user  is  a  system  behavior  as  perceived  by  that  user:  a  sequence  of  observable  outputs 
triggered  by  a  sequence  of  service  operation  invocations.  The  service  specification 
prescribes  future  service  outputs  solely  in  terms  of  the  operations  being  currently 
invoked  and  the  current  service  state,  where  the  current  service  state  is  the  result 
of  applying  the  operation  invocations  completed  so  far  to  the  initial  service  state. 
Services  that  share  the  same  set  of  invocable  operations  and  the  same  set  of  potential 
behaviors  belong  to  the  same  service  type.  In  general,  each  service  of  a  certain  type 
has  a  current  state  distinct  from  that  of  other  services  of  the  same  type,  depending  on 
the  history  of  operation  invocations  completed  so  far  for  that  service.  For  example, 
the  relational  ANSI  SQL  query  and  update  operations  together  with  the  semantics 
defined  for  these  operations  in  an  AiNSI  SQL  manual  define  the  ANSI  SQL  relational 
database  service  type;  if  “employees”  and  “accounting”  are  two  database  services  of 
this  type,  their  states  at  a  certain  point  in  time  are  in  general  different,  depending 
on  the  history  of  updates  applied  since  their  creation. 

The  operations  defined  for  a  service  can  only  be  carried  out  by  a  service  implementa¬ 
tion  consisting  of  one  or  more  servers  (or  objects).  A  server  encapsulates  private  state 
data  by  a  set  of  procedures  (or  methods)  that  provide  the  only  way  for  changing  and 
accessing  the  server’s  state.  A  server  is  a  unit  of  failure  and  growth:  at  any  point  in 
time  a  service  implementation  has  a  membership  consisting  of  an  integer  number  of 
servers.  Because  servers  are  defined  to  be  units  of  failure  and  growth,  a  server  cannot 
span  several  host  systems  that  can  fail  independently.  Service  operation  invocations 
result  in  server  procedure  executions  which  can  cause  the  state  of  the  servers  imple¬ 
menting  the  service  to  change.  Since  the  state  of  a  service  is  a  function  of  the  states 
of  the  servers  that  implement  it,  such  server  state  changes  lead  in  turn  to  service  state 
changes. 

Note  that  it  is  vital  to  avoid  confusing  the  notions  of  service  and  server  (or  object). 
In  the  object-oriented  programming  literature,  the  term  object  is  often  used  in  a 
confusing  way  to  designate  both  what  we  call  service  and  what  we  call  server.  The 
confusion  is  understandable  if  each  service  is  implemented  by  a  single  object,  so  that 
there  is  a  one  to  one  mapping  between  services  and  objects.  This  was  historically 
the  case  for  most  of  the  work  on  object-oriented  programming,  where  issues  related 
to  fault-tolerance  or  replication  were  not  a  primary  focus.  The  confusion  becomes 
awkward  when  a  service  is  implemented  by  several  redundant  servers  (or  objects) 
which  are  independent  units  of  failure  and  growth.  In  this  latter  case,  the  objects 
that  implement  the  service  can  fail  and  restart  without  the  service  users  observing 
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any  service  failure  or  restart.  Thus,  when  services  must  reniaiii  available  cie.spite  the 
failure  of  some  of  the  servers  (or  objects)  that  implement  thetn,  it  be  comes  imperative 
to  distinguish  between  services  and  servers. 

Object  oriented  programming  methodology  requires  that  service  users  not  know  any 
details  about  how  the  state  of  a  service  is  represented  in  terms  of  server  states  or  how 
the  service  operations  are  implemented  by  the  server  procedures.  Such  implementa¬ 
tion  details  are  hidden  from  users,  who  need  only  know  the  externally  visible,  abstract 
service  specification  (Parnas?2].  For  example,  a  database  service  can  be  implemented 
by  a  single  database  server,  by  a  set  of  distributed  servers  that  each  manages  a  frag¬ 
ment  of  the  database  state,  or  by  a  group  of  redundant  distributed  database  servers 
that  each  manages  a  replica  of  the  entire  database.  If  redundant  servers  are  used  to 
implement  a  service,  the  user  need  not  know  what  synchronization  and  replication 
policies  exist  among  servers.  The  synchronization  policy  prescribes  how  far  apart  the 
local  states  of  the  servers  can  get,  where  the  distance  between  the  local  states  of  two 
servers  consists  of  the  difference  in  the  number  of  updates  to  the  initial  state  applied 
so  far  by  them.  If  the  policy  is  loose  synchronization,  a  primary  maintains  the  current 
service  state  while  one  or  more  backups  maintain  past  service  states.  A  bound  on  the 
distance  between  losely  synchronized  servers  can  be  maintained  by  periodic  check¬ 
points  of  the  state  of  the  primary  to  backups.  If  the  policy  is  close  synchronization. 
then  the  servers  act  as  peers  by  interpreting  all  service  requests  in  parallel  and  main¬ 
taining  their  internal  states  close  to  each  other.  The  replication  policy  for  a  service  s 
specifies  how  many  servers  should  exist  for  s.  For  example  a  replication  policy  of  2 
specifies  that  2  redundant  servers  should  be  used  to  implement  s.  The  synchroniza¬ 
tion  and  replication  policies  specified  for  a  service  constitute  the  availability  policy 
for  that  service. 


3  System  Model  and  Assumptions 


We  consider  a  distributed  system  consisting  of  nodes  linked  by  a  communication 
network.  Nodes  can  be  uniprocessors  or  multiprocessors,  but  what  is  essential  is  that 
they  are  units  of  failure  and  growth;  at  any  point  in  time  a  node  is  perceived  as  either 
correctly  running  (or  active)  or  failed  (not  active)  by  another  node.  Servers  run  in  the 
nodes  of  the  system.  If  a  node  possesses  all  the  physical  resources  needed  for  running  a 
server  for  a  certain  service,  it  is  called  a  (potential)  host  for  that  service.  For  example 
a  node  with  enough  computing  power  and  memory  that  can  access  the  disk(s)  storing 
the  “emplyees”  database  is  a  potential  host  for  the  “employees”  database  service. 
The  set  of  all  servers  that  can  run  in  the  hosts  defined  for  a  service  s  forms  the  team 
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of  servers  that  can  be  used  to  impleincat  s.  With  present  operating  systems,  a  server 
for  a  certain  service  can  be  started  in  a  host  node  only  by  another  process  that  runs 
in  the  same  node.  We  assume  the  nodes  have  amnezia-crash  failure  semantics;  after 
a  crash,  a  node  restarts  in  a  predefined  initial  state  independent  of  the  inputs  seen 
before  the  crash  [Crist91).  We  do  not  assume  any  particular  network  topology:  it  can 
be  point-to-point  or  broadcast  channel  based. 

To  make  the  presentation  of  Availability  Management  as  simple  as  possible,  w-e  wdll 
assume  a  syiiclironous  [CrisOl]  communication  network.  Roughly  speaking,  a  syn 
chronous  network  enables  any  two  active  servers  to  communicate  w'ithin  a  known, 
bounded  time,  so  that  no  communication  partitions  are  possible  (a  more  elaborate 
definition  and  ways  to  implement  such  communication  networks  arc  given  in  [Cris91]). 
The  synchronicity  assumption  enables  us  to  avoid  discussing  issues  related  to  poten¬ 
tial  divergence  of  states  among  redundant  servers  due  to  partitioning  and  the  need  to 
restrict  activity  to  majority  groups  in  order  to  prevent  such  divergence.  We  discuss 
later  extensions  to  the  case  of  asynchronous  communication  networks  that  do  not 
guarantee  a  bound  on  the  time  it  takes  for  an  active  server  to  send  a  message  to 
another  active  server. 

A  synchronous  network  allows  one  to  implement  three  group  communication  ser¬ 
vices  that  are  fundamental  for  implementing  the  replicated  data  management  needed 
for  implementing  automatic  availability  management;  internal  clock  synchronization, 
synchronous  atomic  broadcast  and  synchronous  membership. 

An  internal  clock  synchronization  service  ensures  that  the  clocks  of  active  nodes  are 
synchronized  within  some  known  constant  maximum  deviation  at  any  point  in  real 
time  and  that  such  synchronized  clocks  run  within  a  linear  envelope  of  real  time. 
Protocols  for  achieving  internal  clock  synchronization  in  synchronous  communication 
networks  are  described  in  [CASS6],  [Cris89),  [HSS84),  [K87],  [LM85],  [LL84],  [S87], 
and  [ST87]. 

A  synchronous  atomic  broadcast  service  enables  any  member  s  of  a  team  A  to  broad¬ 
cast  at  any  (synchronized)  clock  time  T  a  message  m  to  the  group  of  active  A  mem¬ 
bers  so  that  the  following  properties  hold,  for  some  time  constant  D.  If  s  initiates  the 
broadcast  of  m  at  clock  time  T,  then  at  T-f  D,  m  is  either  delivered  to  all  A  members 
that  are  active  or  is  not  delivered  to  any  active  member  (atomicity).  All  messages  de¬ 
livered  are  delivered  in  the  same  order  at  each  active  A  member  (order).  If  the  sender 
s  does  not  fail  while  broadcasting  m,  then  all  active  A  members  deliver  m  at  T-fD 
(termination).  Protocols  for  implementing  synchronous  atomic  broadcast  services  for 
point-to-point  and  broadcast  channel  based  networks  are  given  in  (BD85j,  [CASD85] 
and  {Cris90].  All  of  these  protocols  depend  on  the  existence  of  an  underlying  internal 
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clock  synchronization  service. 


A  synchronous  membership  service  enables  active  A  ineinbers  to  agrtn*  in  a  bounded 
time  on  what  member  failures  and  recoveries  happen  in  the  liistory  of  the  team.  To 
achieve  this,  a  membership  service  organizes  active  team  members  into  a  sequence  of 
dynamic  groups  that  exist  over  time,  so  that  the  following  properties  hold,  for  some 
time  constants  D  and  J.  New  groups  are  created  only  in  respon.se  to  failures  and 
recoveries  (stability).  All  members  of  a  group  agree  on  the  membership  of  t'  e  group 
(agreement).  After  joining  a  common  group,  any  two  active  A  members  a,  b  will  join 
the  same  sequence  of  groups  for  as  long  as  they  stay  active,  so  that  they  sec  all  failure 
and  recovery  events  that  affect  the  team  A  in  the  same  order  (order).  Any  failure  of 
an  active  A  member  f  is  detected  within  D  clock  time  units,  that  is,  it  leads  within 
at  most  D  clock  time  units  to  the  creation  of  a  new  group  that  includes  all  active  A 
members  but  excludes  f  (bounded  failure  detection  delay).  Any  recovery  of  a  member 
j  leads  within  at  most  J  clock  time  units  to  the  creation  of  a  new  group  that  includes  j 
in  addition  to  all  the  other  active  A  members  (bounded  join  delay).  Several  protocols 
for  implementing  a  synchronous  membership  service  in  point-to-point  and  broadcast 
based  networks  are  described  in  [CrisQlj.  The  protocols  depend  on  the  existence  of 
underlying  internal  clock  synchronization  and  synchronous  atomic  broadcast  services. 

VVe  assume  that  clients  of  a  service  s  address  service  requests  in  a  location  transparent 
manner.  That  is,  to  ask  for  some  operation  o  of  s  to  be  executed,  a  client  simply 
passes  s.o  to  the  request/reply  transport  service  available  on  the  client’s  node  without 
needing  to  know  the  location  of  the  servers  that  implement  s.  The  transport  service 
routes  the  request  to  some  subset  of  the  servers  for  s  and  then  routes  the  reply  back. 
It  can  be  connection  oriented  or  connection-less,  such  as  a  remote  procedure  call  ser¬ 
vice.  To  achieve  location  transparency,  the  request /reply  transport  service  can  make 
use  of  a  registry  service  that  maintains  a  mapping  from  server  locations  to  the  service 
they  provide.  When  a  server  providing  service  s  starts  on  node  p,  it  registers  with 
the  registry  service  the  fact  that  it  provides  service  s  on  p.  When  a  client  on  node 
q  invokes  s.o,  the  local  request/reply  transport  server  looks-up  s  in  the  registry  and 
finds  out  that  the  s.o  request  has  to  be  sent  to  node  p.  To  ensure  high  availability  of 
the  registry  service  on  which  the  distributed  request/reply  transport  service  depends, 
its  implementation  may  consist  of  a  team  of  replicated  servers,  that  make  use  of  the 
membership  and  atomic  broadcast  services  described  above  to  maintain  the  consis¬ 
tency  of  the  replicated  registry  mapping  in  the  presence  of  failures,  recoveries  and 
concurrently  initiated  updates.  We  assume  that  the  request /reply  transport  service 
masks  failures  of  s  servers  to  clients  for  as  long  as  at  least  one  server  for  s  remains 
active,  so  that  if  no  timely  reply  to  a  service  request  s.o  sent  from  node  q  to  an  s 
server  is  obtained,  the  request  is  automatically  re-sent  to  some  other  s  server  that  has 
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registered  with  the  registry  service. 


4  Requirements 

The  goal  of  the  Availability  Management  service  is  to  enforce  automatically  the  avail¬ 
ability  policies  specified  for  the  services  offered  by  a  distributed  system  to  clients 
(without  violating  any  constraints  implied  by  their  synchronization  policies).  For 
example  if  the  availability  policy  for  a  service  s  is  (loose-synchronization,  2)  and  the 
primary  server  fails,  the  Availability  Management  service  is  required  to  first  promote 
the  s  backup  to  primary  and  then  start  another  s  backup,  instead  of  just  cold-starting 
another  primary  for  s.  This  will  minimize  the  time  s  is  unavailable  to  clients,  since  it 
is  much  faster  to  promote  a  backup  to  primary  then  to  cold-start  a  primary.  Because 
the  underlying  request/reply  transport  service  will  automatically  re-route  client  re¬ 
quests  to  the  new  primary  after  it  registers  as  primary  for  s  with  the  registry  service, 
clients  do  not  see  a  service  failure;  the  behavior  seen  by  them  is  indistinguishable 
from  that  seen  when  no  s  primary  failure  occurs  but  a  request  to  perform  some  s 
operation  is  lost  and  re-transmitted. 

To  simplify  the  presentation  of  the  notion  of  Availability  Management  service,  we 
will  consider  that  this  service  must  automatically  enforce  a  single  availability  policy 
for  the  entire  set  S  of  critical  services,  and  that  the  only  reason  why  a  server  for  some 
critical  service  s€S  can  crash  is  the  crash  of  its  underlying  node.  From  this  presen¬ 
tation  it  should  not  be  dificult  to  imagine  how  to  deal  with  the  cases  when  service 
implementations  follow  several  different  availability  policies  and  when  server  crashes 
occur  even  if  the  underlying  nodes  do  not  crash.  For  concretness,  we  will  consider  that 
the  availability  policy  specified  by  the  system  administrator  for  all  critical  services  is 
(loose-synchronization,2).  The  reason  for  our  choice  is  that  primary /backup  server 
groups  are  very  popular  commercially  [Gray86]  and  that  most  of  the  critical  services 
in  the  A  AS  system  that  we  helped  design  are  implemented  by  primary/backup  server 
groups  [CDD90]. 

If  we  denote  the  set  of  active  system  nodes  by  N,  and  the  set  of  hosts  for  a  service  sES 
by  H,  the  Availability  Management  service  is  required  to  maintain  a  primary  and  a 
backup  for  s  on  distinct  nodes  as  long  as  there  exist  at  least  two  nodes  in  N  C]  H  and 
a  primary  for  s  as  long  as  the  number  of  nodes  in  D  is  one.  If  there  is  a  backup 
when  a  primary  must  be  started,  then  the  backup  must  be  promoted  to  primary,  to 
minimize  unavailability  of  s  to  clients  (the  backup  does  not  provide  service,  only  the 
primary  answers  to  client  requests).  Another  constraint  implied  by  the  availability 


policy  of  s  is  that  at  no  time  there  should  exist  two  primaries  for  s.  We  are  not 
concerned  here  with  the  local  state  check-pointing  protocols  followed  by  a  primary 
and  a  backup  to  maintain  a  bound  on  the  distance  between  their  local  states  or  after 
a  new  backup  is  started.  Our  view  is  that  check-pointing  is  an  application  specific 
issue  that  is  orthogonal  to  the  system  wide  service  Availability  Management  issue. 

Following  [Cris85],  we  view  the  Availability  Management  service,  as  exporting  two 
kinds  of  operations  to  two  concurrent  “users”;  the  human  adminisrator  and  the  Ad¬ 
verse  Environment.  The  operations  that  the  system  administrator  can  invoke  arc 
start-service(s),  stop-service(s),  add-host(n),  remove-host(n)  and  start-node(n),  while 
the  Adverse  Environment  can  invoke  the  crash-node(n)  operation.  Often,  nodes  re¬ 
boot  automatically  after  a  crash,  in  which  case  the  start-node  operation  is  not  really 
performed  by  the  system  administrator,  but  by  a  third  concurrent  “user”;  the  time. 
In  other  words,  with  automatic  reboot,  the  passage  of  a  certain  number  of  time  units 
will  trigger  a  start-node(n)  invocation  after  a  crash-node(n)  invocation  by  the  Ad¬ 
verse  Environment.  While  start-service,  stop-service,  add-host  and  remove-host  are 
down-calls  from  the  Administrator’s  command  interpreter  service,  the  crash-node  and 
start-node  command  invocations  are  up-calls  from  the  underlying  node  membership 
service. 

We  specify  our  Availability  Management  service  by  first  defining  its  abstract  state 
and  then  describing  the  state  transitions  that  lake  place  in  reponse  to  the  above 
human  and  Adverse  Enironment  operations. 

The  state  of  the  Availability  Management  service  is  recorded  by  the  following  con¬ 
stants  and  state  variables: 

const  P:  Set;  %  the  set  of  all  nodes  of  the  system 
const  S;  Set;  %  the  set  of  all  critical  services 

var  N:  Set-of-P  init  {};  %  set  of  active  nodes 

var  H:  S — >  Set-of-P  init  A.{};  %  hosts  for  various  services 

var  on:  S — >Boolean  init  false;  %  on(s)=true  when  sGS  is  started 

var  primary.  S — »  PU{J-}  init  A.±;  %  points  to  node  hosting  primary  for  s 

var  backup:  S — >PU{i.}  init  A.±;  %  points  to  node  hosting  backup  for  s 

When  there  exists  no  primary  or  backup  for  s€S,  the  primary{s)  and  backup{s) 
pointers  have  the  undefined  value  X.  To  avoid  complications  related  to  load  balancing 
and  load  shed,  we  will  assume  that  all  nodes  in  the  system  have  distinct  names  that 
are  totally  ordered  and  that  when  a  server  for  s  must  be  started,  the  free  host  with 
the  highest  name  is  simply  chosen  to  run  the  server  for  s.  More  realistically,  the 
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Availability  Management  service  will  have  to  maintain  a  load  variable  that  maps 
nodes  to  their  load.  This  variable  will  have  to  be  periodically  updated  by  each  node 
and  used  to  select  the  host  where  a  server  for  s  must  be  started  as  the  least  loaded 
node  (if  there  is  a  tie,  take  the  least  loaded  node  with  the  highest  name).  Thus, 
our  example  function  for  selecting  the  node  to  start  a  certain  server  among  a  set  of 
potential  hosts  A  is  simply: 


select-host(A:Subset-of-P)  returns  Pu{±}  = 
if  A={}  then  ±  else  max(A)  fi; 


The  intended  state  transitions  for  the  operations  start-service,  stop-service,  add-host, 
remove-host,  start-node  and  crash-node  exported  by  the  Availability  Management 
service  are  as  follows. 


start-service(s:S)  = 

if  on(s)  then  inform  operator  “s  already  started” 
else  on(s)<— true; 

start-servers(  A  fl  //(s)); 

fi; 

stop-service(s:S)  = 
on(s)*-false; 
if  bacbup(s)  ^  X 

then  stop  server  for  s  on  backu2){s)', 
backup(s)*—  X; 

fi; 

if  primary{s)^  X 

then  stop  server  for  s  on  primaTy{sy, 
primary{s)*--  X; 

fi; 

add-hosts(h:Set-of-P,  s:S)  = 

//(s)^  //(s)U  h; 

if  on{s)  then  start-servers(A  n  A(s))  fi; 

remove-hosts(h:Set-of-P,  s:S)  = 

A(s)^  //(s)  -  h; 
if  backup{s)£\\ 

then  stop  backup  server  for  s  on  backup(s); 
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backup{s)  *—  ±; 
start-backup(s,A"  O  //(sj); 


fi; 

if  primary{s)£h 

then  stop  primary  server  on  primary{s)\ 
primary{s)*—  ±; 
promote-backup(s,iV  n  //(s)); 

ft; 

start-node(n)  = 

N  ^  Nu{n)-, 

for  all  sGS 

do  if  on(s)  and  n  G  H{s) 

then  start-servers(s,A'  n  //(s)) 

fi; 

od; 

crash-node(n)  = 

N  N  -  {n}; 
for  all  sGS 

do  if  prmary(s)=n  then  promote-backup(s,A'  n  i/(s))  fi; 
if  backup(s)=i\ 
then  backup{s)  <—  J.; 

start-backup(s,(A'^  fl  JI (s))- {priTnary{s)}y, 

fi; 

od; 

were  the  start-primary,  start-backup,  promote-backup  and  start-servers  state  transitions 
are  defined  as  follows: 

start-primary(s;S,  >i:Set-of-P)  = 
if  priTnaTy{s)=±  and  A  ^ 
then  primary(s)*—  select-host(A); 

start  primary  server  for  s  on  primary{s) 

fi; 

start-backup(s:S,  /l:Set-of-P)  = 
if  backup{s)=±  and  A  ^  {} 
then  backup{s)  *—  select-host(v4); 

start  backup  server  for  s  on  backup{s) 

fi; 


12 


promote*  backupfs'.S,  u!  i'; 

if  backup(s)  f-  1 

t!u‘i.  pruiaoli;  backup  server  l<>r  s  nii  6acAupt  j), 
prtmat  y(s)  *••-■  bui.-kupisj', 
backupis)  .1; 

slarl-backup(s,/l  {prnaaryir^iy:: 
fi; 

.st,arI.'Sf'rvfis(5:S,  .*l:Set.  ut  1‘) 

start'priinaryfs,.-l ); 
if  priiiKirijiv,}  1. 

tlioa  $t.art  baf;ktsp(»..4  -  jprjKHirei  s i}  j 

H; 


1  he  airuve  .slate  ti .nisiliciiis  an*  leipnicil  i.i>  !•<•  ner iorunai  h)  the  Availaiiihty  M-tSi 
(ijiferiiotit  servii'c  tor  ,  ■  ioiiL’  <is  at  ic.i.s*  isne  !ii'<ic  is  rutiiiiiiv  '*.<  ■■  1'  ><'\rrai 

ovetils,  sucii  as  lailuies  atai  restart',  iiiu.  irretit!)  at  .liatut  the  satia*  tntie, 

then  th<'  correspoiuitn,!’  state  Ir.rtisit iiais  ata-  ti-'jiiire.l  to  o.  cur  in  soiue  senaS 
Moreover,  a.ssutning  ilia!  the  start  o!  a  primary  or  i>ai'ku[)  server  for  servitc  ■■  takes 
a,  boutuieti  amount  oi  time,  each  transitimi  ts  refjinre<J  to  l/tkf  a  boutnlefi  arrKujtit  of 
time.  In  effect,  if  an  Availahili' v  Manaaemetif  service  satisfyinr.  tbe  al>ove  reipiire 
rnents  exists  in  a  .system,  it  ensure.s  the  conttnuous  nfailahility  oi  a  serv  ice  s  to  clu’nts 
for  a.s  long  as  there  exists  at  lea.st  one  active  nmie  tliat  can  liost  ,s,  despite  any  number 
of  possibly  concurrent  node  failures  and  joins 

5  Replicated  Design 

To  ensure  that  the  Availability  Management  service  is  itself  availalde  a-s  long  as  at  !ea,st 
one  node  is  active,  t-he  constants  and  variables  introduced  in  the  previous  .sertion  are 
repUcaUd  on  ail  node.s.  Thus,  the  team  of  Availability  Management  servers  is  hosted 
by  the  sot  f’  of  all'  nodes  of  the  system.  These  replicated  variables  are  managed  at 
any  point  in  time  by  a  group  of  active,  closely  .synchronized  Availabilit}’  Management 
servers.  The.se  server.s  are  initialized  after  the  underlying  clock  synchronization,  node 
membership  and  atomic  broadcast  .‘iorvices  are  ir  itialized  at  node  .startup.  While 

'For  a  hierarchical  approach  to  Availabiliu  Management  in  large  sy..tem.s,  see  [CDDOh] 
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servers  for  these  services  exist  at  any  node,  the  Administrator  Command  Interpreter 
service  is  implemented  by  a  server  running  in  only  one  of  the  nodes  (if  that  node- 
fails,  the  administrator  can  login  on  another  node).  When  joining  the  group  of  active 
Availability  Management  servers,  a  newly  started  server  follows  the  state  initialization 
protocol  described  in  [CrisQl],  where  the  new  server  gets  an  olde-r  value  of  the  service 
state,  monitors  all  updates  to  that  past  stale  until  its  reception  and  then  applies  these 
updates  to  the  old  state  to  get  an  up-to-date  state.  We  will  not  repeat  the  description 
of  that  join  protocol  here.  Each  Availability  Management  server  has  access  to  the 
identity  of  its  underlying  node  by  invoking  a  predefined  function  rnyid. 

The  design  to  be  described  depends  directly  on  the  membership  and  atomic  broadcast 
services  described  in  section  3:  any  update  to  a  replicated  state  variable  is  either  a 
result  of  an  atomic  broadcast  or  of  a  membership  change  notification  that  appears 
to  the  replicated  Availability  Managers  as  an  atomic  broadcast  (for  more  details  sec 
[Cris9l]).  Because  all  updates  are  received  m  the  same  order  [CASD85],  [Cris9lj  at 
all  active  Availability  Managers,  after  an  Availability  Manager  j  joins  the  group  of 
active  Availability  Managers,  its  local  state  variables  N,  H,  on,  primary,  backup  will 
go  through  the  sar/ie  sequence  of  values  as  the  local  variables  of  any  other  Availability 
Manager  rn  that  was  already  joined  when  j  joined,  and  this  will  hold  true  until  j  or  in 
fail.  Thus,  when  any  two  members  of  the  group  of  active  Availability  Managers  learn 
about  the  same  event,  such  as  an  administrator  command  invocation  or  a  change 
in  the  membership  of  active  nodes,  they  have  identical  local  states,  so  they  reach 
identical  decisions  about  what  has  to  be  done.  For  example  after  a  failure  of  a  node 
hosting  the  primary  server  for  s,  all  Availability  Managers  decide  that  the  manager 
running  in  the  backup{s)  node  which  hosts  the  backup  server  for  s,  will  have  to 
promote  it  to  primary  and  that  the  manager  running  in  the  node  with  the  highest 
name  in  the  set  N  fl  //(s)  -  backup{s),  say  n,  will  have  to  start  a  new  backup  for 
s.  So,  all  Availability  Managers  update  their  state  according  to  these  decisions  and 
they  all  ask  themselves  whether  they  are  the  ones  running  in  the  backup{s)  and  n 
nodes  by  evaluating  the  myid  =  backup{s)  and  rnyid  =  max(NnH{s)  -  {backup{s)}) 
expressions,  respectively.  The  managers  running  in  the  nodes  where  these  expressions 
evaluate  to  true  then  do  the  “real”  work  by  locally  promoting  the  backup  server  for 
s  and  by  locally  starting  a  backup  for  s,  respectively.  When  more  realistic  select-host 
fuiictions  are  used  instead  of  our  simple  select-host  example,  it  is  crucial  that  the 
value  of  an  invocation  of  select-host  depend  only  upon  the  replicated  state  variables 
maintained  by  each  Availability  Manager,  so  that  any  invocation  yields  the  same 
result  when  invoked  in  response  to  the  same  event  at  any  two  members  of  the  group 
of  active  Availability  Managers. 

After  completing  the  state  initialization  protocol  described  in  (Cris91]  that  initializes 


14 


the  N,  H,  on,  primary,  backup  variables,  an  Availability  Manager  enters  an  infinite 
loop  inside  which  it  waits  for  the  following  event  types:  an  upcall  from  the  member¬ 
ship  service  that  is  a  notification  of  a  change  in  the  membership  of  active  nodes  (and 
hence,  in  the  group  of  active  Availability  Managers),  an  upcall  from  the  atomic  broad¬ 
cast  service  telling  about  an  update  to  the  replicated  state  variables,  or  a  downcall 
from  the  Command  Interpreter  running  in  the  same  node,  that  informs  the  server 
about  a  command  issued  by  the  system  administrator.  The  code  that  implements 
the  reactions  to  these  events  is  atomic  with  respect  to  synchronization  (for  simplicty 
we  do  not  deal  explictly  with  synchronization  issues  related  to  making  the  parallel 
interpretation  of  these  events  serializable). 

task  Availability- Manager  = 

co7ist  P,S:  Set; 

var  N:  Set-of-P  init  {}; 

var  //:  S — >  Set-of-P:  init  A.{}; 

var  on:  S — ^Boolean  init  A. false; 

var  primary:  S — '  PU{X}  init  A.±; 

var  backup:  S — 'Pu{l.}  init  A.±; 

initialize( A,  If,  on,  primary,  backup); 

loop 

when  receive-from-administrator(command): 
case  command  of: 

start-servic'"'  '■  ’''on(s) 

then  st^  iinistrator( “already  started”,s) 

else  if  AC  1  v-y  =  {) 

then  send-to-adininistrator(“no  active  hosts  for”,s) 
else  atomically-broadcast(“start-service”,s) 

fi; 

fi; 

stop-service(s):  if  on(s) 

then  atomically-broadcast( “stop-service”, s) 
else  send-to-administrator(“already-stopped”,s) 

fi; 

add-hosts(h,s):  if  hc  A(s) 

then  send-to-administrator(“already-hosts”,h,s) 
else  atom ically- broadcas t ( “ad d -hosts” ,h ,s ) 

fi; 

reraove-hosts(h,s):  if  hn//(s)  ^  {} 
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then  atomically-broadcast(“remove“host”,hn// (s 
else  send-to-admimstrator(“not-hosts”,h,s) 

fi; 

endcase; 

when  receive-atomic-broadcast(inessage)  from  p: 
case  message  of: 

(“start-service”, s):  on(s)  «—  true; 

Start-Servers(iV  n  /f(s)); 

( “stop-service”, s);  on(s)  false; 

if  backup(s)  ± 
then  if  backup{s)=my\d 

then  locally  stop  backup  server  for  s; 

fi; 

backup(s)*—  ±; 

fi; 

if  primary{s)^  i. 
then  if  primarj/(s)=myid 

then  locally  stop  primary  server  for  s 

fi; 

primary{s)*—  X; 

fi; 

(“add-hosts” ,h,s):  //(s)  JI{s)  U  h; 

if  on(s)  then  Start-Servers(A^  D  H{s))  fi; 
(“remove-hosts” ,h,s):  //(s)  <—  //(s)  -  h; 
if  backup{s)£h 
then  if  iac/:np(s)=myid 

then  locally  stop  backup  server  for  s 

fi; 

backup{s)  <—  X; 

Start'Backup(s,Af  D  /f(s)); 

fi; 

if  primary[s)Eh 

then  if  primary{s)=my\d 

then  locally  stop  primary  server  for  s 

fi; 

primary{s)*~  X; 

Promote-Backup(s,A'^  fl 

fi; 

endcase; 

when  receive-membership-notification(change,n): 
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case  change  of: 

“join”:  N  ^  NU  {n}; 
for  all  sGS 

do  if  on(s)  and  n  G  H{s) 

then  Start-Servers(s,A"  n  //(s)) 

fi; 

od; 

“crash”:  N  *—  N  ~  {n}; 
for  all  sGS 

do  if  primary{s)=:n  then  Promote- Backup(s,A'^  n  H{s))  fi; 
if  6acfcup(s)=n 
then  backup{s)  *—  ±; 

Start-Backup(s,(A'  n  H (s))— {prirnary{s))); 

fi; 

od; 

endcase; 

endloop; 


The  Start-Primary,  Start-Backup,  Promote-Backup  and  Start-Servers  procedures  in¬ 
voked  by  the  implementation  implement  in  a  decentralized  manner  the  abstract  state 
transitions  defined  by  start-primary,  start-backup,  promote-backup,  and  start-servers 
of  the  previous  section,  respectively. 


procedure  Start-Primary(s:S,  i4:Set-of-P); 
if  primary [s)=±  and  A  ^ 
then  primary{s)*—  select-host(y4); 
if  primary{s)—myid 
then  locally  start  primary  server  lor  s 

fi; 

fi; 

procedure  Start-Backup(s:S,  yl:Set-of-P); 
if  backup{s)—±  and  A  ^  {} 
then  backup{s)  *—  select-host(>l); 
if  6acfctxp(s)=myid 
then  locally  start  backup  server  for  s 

fi; 

fi; 
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procedure  Promote-Backup(s:S,  A:Set-of-P); 
if  backup{s)  ^  i. 
then  if  6acfcup(s)=myid 

then  locally  promote  backup  server  for  s 

fi; 

primary(s)  +—  backup{s)-, 
backup{s)  X; 

Start-Backup(s,j4  —  {primary{s)}); 

fi; 

procedure  Start-Servers(s:S,  /l;Set-of-P); 

Start- Primary  (s,y4); 
if  primary(s)  ^  X 

then  Start-Backup(s,j4  —  {primary{s)]) 

fi; 


6  Extensions 


An  extension  of  the  above  design  that  would  deal  with  several  services  with  distinct 
availability  policies  should  pose  no  difficulty  at  this  point.  It  is  sufficient  for  this 
purpose  to  keep  track  for  each  service  S  of  its  availability  policy  and  ensure  that 
it  is  automatically  enforced  along  the  lines  of  the  previously  sketched  Availability 
Management  service  design.  Other  extensions  that  would  allow  the  administrator  to 
taylor  the  reaction  of  the  Availability  Management  service  to  the  particular  needs  of  a 
system  installation  by  letting  the  administrator  define  the  reaction  that  ought  to  take 
place  in  response  to  each  type  of  event  for  each  kind  of  declared  availability  policy 
in  a  specialized  high  level  programming  language  are  also  quite  straightforward.  An 
important  extension  of  the  simple  design  presented  here  deals  with  saving  the  state  of 
the  Availability  Management  service  on  non-volatile  storage,  so  as  to  enable  a  quick 
restart  after  a  total  system  failure  (possibly  due  to  a  general  power  failure).  This  is  a 
non-straighforward  problem  if  one  wants  to  solve  it  right.  We  leave  it  to  the  reader  to 
imagine  various  possible  solutions  and  analyse  their  relative  merits  and  drawbacks. 

In  the  reminder  of  this  section  we  limit  ourselves  to  discuss  possible  extensions  to  the 
case  of  asynchronous  communication  networks  that  do  not  guarantee  any  bound  on 
the  time  needed  to  communicate  between  two  active  nodes.  The  major  difficulty  in 
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such  networks  is  that  a  process  p  cannot  distinguish  between  being  partitioned  from 
another  process  q  and  a  failure  of  q.  This  leads  to  the  possibility  that  the  members  of 
a  team  A  join  distinct  groups  at  the  same  time  and  see  different  sequences  of  updates 
to  their  local  states.  To  prevent  such  divergence  it  is  sufficient  to  let  updates  proceed 
only  in  majority  groups.  It  is  possible  to  design  membership  and  atomic  broadcast 
protocols  that  will  let  any  two  active  team  members  that  continuously  join  majority 
groups  see  the  same  sequence  of  global  state  updates,  including  node  restarts  and 
failures  despite  unbounded  communication  delays.  The  main  drawbacks  of  such  a 
solution  are  that  no  availability  management  will  happen  when  less  than  a  majority 
of  nodes  are  active  and  that  there  will  no  bounds  on  the  time  it  takes  to  react  to  events 
such  as  administrator  commands  and  node  membership  changes.  Another  alternative 
design  could  be  based  on  a  leader  that  orders  all  the  events  happening  in  the  system 
by  acting  as  a  funnel  for  them.  The  hard  problem  there  will  be  to  elect  a  new  leader 
upon  failure  of  the  old  leader  so  as  to  ensure  that  at  no  point  in  real-time  there  exist 
two  leaders  that  could  take  conflicting  actions.  A  leader  based  solution  will  of  course 
share  the  drawbacks  inherent  to  any  solution  based  on  asynchronous  communication; 
need  for  majority  presence  and  no  bounds  on  reaction  times. 
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EXTENDED  ABSTRACT 


1  Motivation 

With  the  increasing  reliance  on  digital  computers  in  embedded  systems,  the  need  for  dependable  systems 
that  deliver  correct  results  in  a  timely  manner  has  become  more  crucial.  Large-scale  embedded  systems 
are  being  built  in  diverse  applications  such  as  avionics,  air  traffic  control,  manufacturing,  and  patient 
monitoring.  Tltcsc  systems  often  have  strict  availability  and  timing  requirements  that  affect  one  another 
in  subtle  ways.  For  example,  availability  requirements  are  often  enforced  as  timing  constraints  on  certain 
tasks  in  the  system.  Alternatively,  missing  a  deadline  on  a  critical  task  in  a  real-time  system  may  result  in 
a  system  failure.  As  real-time  fault-tolerant  applications  become  more  sophisticated,  the  software  design 
and  development  process  has  become  increasingly  more  complex.  This  paper  argues  that  the  traditional 
approaches  for  providing  fault-tolerance  in  asynchronous  distributed  systems  is  not  necessarily  appropriate 
for  timc-critical  applications. 

The  motivation  for  this  work  is  based  on  two  observations: 

1 .  the  characterization  of  design  methodologies  for  fault-tolerant  systems  based  on  redundancy  in  space 
or  redundancy  in  time  is  inadequate  for  real-time  systems;  and 

2.  establishing  a  global  consistent  system  state  based  on  the  causal  order  of  messages  among  cooperating 
processes  docs  not  consider  the  temporal  consistency  requirements  imposed  on  the  data  in  a  system. 


2  System  State 

Fault-tolerance  can  be  defined  informally  as  the  ability  of  a  system  to  provide  a  service  in  a  timely  manner 
even  in  the  presence  of  failures.  A  common  approach  for  building  fault-tolerant  systems  is  to  replicate 
servers  that  fail  independently.  The  mtun  strategies  for  structuring  fault-tolerant  servers  are  passive  and 
active  replication.  In  passive  replication  schemes  [4, 1],  the  system  state  is  maintained  by  a  primary  and  one 
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or  more  backup  processes.  Tfic  primary  checkpoints  its  Jocal  state  to  the  backups  such  that  a  backup  can 
take  over  upon  detecting  a  failure  of  the  primary.  In  active  replication  schemes  [6,  13,  2),  also  known  as  tlie 
state  machine  approach,  a  collection  of  identical  server  processes  maintain  replicated  copies  of  the  system 
state.  Updates  are  applied  atomically  to  all  the  replicas  such  that  after  detecting  the  failure  of  a  process,  the 
remaining  processes  continue  the  service. 

In  a  distributed  environment,  one  can  view  the  system  as  a  collection  of  cooperating  processes  The 
global  system  state  is  the  aggregate  set  of  the  local  states  of  the  cooperating  proces.ses,  A  widely-studied 
approach  is  to  establish  consistent  global  system  stales  as  a  computation  progresses  and  to  roll  back  to  an 
earlier  system  state  when  a  failure  is  delected.  A  consistent  global  system  state  is  defined  to  be  a  state  that  is 
reachable  from  the  initial  state.  Numerous  checkpointing/logging-based  schemes  for  establishing  a  global 
system  state  in  a  distributed  environment  have  been  proposed  in  the  past,  e.g.,  [3, 8, 14].  In  these  approaches, 
each  process  checkpoints  its  state  locally,  and  the  messages  between  processes  are  logged  synchronously 
or  asynclironously.  Upon  detecting  a  failure,  a  global  system  state  is  established  by  a  rollback  to  an  earlier 
point  in  the  compulation  that  could  have  been  reachable  from  the  initial  system  state. 

It  has  been  argued  in  the  past  that  the  real-time  and  fault-tolerance  requirmients  of  a  system  are  not 
orthogonal.  Tlie  difliculty  in  applying  traditional  approaches  for  providing  fault-tolerance  to  real-time 
systems  is  that  time  is  not  explicitly  considered  in  defining  a  consisieni  sysiem  stale.  In  particular,  several 
distinguishing  characteristics  of  real-time  systems  must  be  considered: 

1.  Tuning  Constraints:  the  correctness  ol  a  computation  is  dependent  not  only  on  the  correctness  of  its 
results,  but  also  on  meeting  stringent  timing  requirements. 

2.  Perishable  Data:  the  data  in  these  systems  are  perishable  in  the  sense  that  the  u.scfulness  of  a  data 
item  decreases  with  the  passage  of  time. 

3.  Weaker  Consistency  Constraints:  the  semantics  of  real-time  data  allows  the  exploitation  of  weaker 
consistency  requirements  than  the  causal  or  total  order  on  the  operations  in  a  system. 

4.  Redundancy  in  Data  Semantics:  the  characterization  of  design  methodologies  based  on  redundancy 
in  space  or  redundancy  in  time  is  inadequate  since  the  semantics  of  certain  data  items  allows  a  stale 
or  approximate  data  value  to  be  used. 

Before  presenting  alternative  models  for  specifying  consistent  system  states  of  real-time  processes,  we 
will  elaborate  on  the  above  points.  We  use  two  examples  to  illustrate  why  a  different  definition  of  a  system 
state  is  more  appropriate  for  real-time  system. 

Whether  the  primary  motivation  for  fault-tolerance  is  to  ensure  data  integrity  or  to  mask  failures  at 
run-time,  the  notion  of  a  consistent  system  state  after  a  failure  defines  the  correctness  criteria  for  different 
approaches.  The  precise  definition  of  a  consistent  state  in  a  real-time  system  is  complicated  by  one  crucial 
factor:  a  system  state  in  time-critical  applications  changes  by  the  passage  of  time.  This  is  a  key  difference 
from  asynchronous  systems  in  which  time  is  not  considered  explicitly  in  defining  a  system  state.  This 
ha.s  several  important  implications;  First,  redundancy  management  must  be  predictable;  meeting  stringent 
timing  constraints  and  achieving  fault-tolerance  requirements  may  be  contradictory  goals  in  some  cases. 
Second,  restoring  a  system  state  (by  rolling  backward  or  forward)  must  satisfy  certain  timing  properties 
imposed  on  the  data  in  the  system.  Since  usefulness  of  real-time  data  diminishes  with  the  passage  of  time, 
the  definition  of  a  consistent  system  state  must  include  the  temporal  relationship  between  data  objects. 
Tliird,  the  ordering  constraints,  such  Lamport’s  happened-before  relation  or  the  total  order  guaranteed  by 
atomic  multicast,  may  be  weakened  in  managing  replicated  data  in  real-time  systems.  Finally,  the  inherent 


22 


non-determinism  of  time-crilical  systems  can  be  exploited  in  developing  new  fault-tolerance  strategics.  The 
imprecise  computation  technique,  for  example,  exploits  data  semantics  in  obtaining  timely  hut  lesser  quality 
results  in  iteratively  improving  calculations. 

A  real-time  system  may  fail  to  function  correctly  either  because  of  errors  in  its  hardware  and/or 
software  or  because  of  not  responding  in  time  to  meet  the  timing  requirements  that  are  usually  imposed  by 
its  “environment.”  Hence,  a  real-time  system  can  be  viewed  as  one  that  must  deliver  the  expected  service 
in  a  timely  manner  even  in  the  presence  of  faults.  A  missed  deadline  can  be  potentially  as  disastrous  as  a 
system  crash  or  an  incorrect  behavior  of  a  critical  ta.sk,  e.g.,  a  digital  control  system  may  lose  stability.  Since 
the  logical  correctness  of  a  system  may  be  dependent  on  the  timing  correctness  of  other  components,  the 
task  of  separating  logical  correctness  from  timing  correctness  may  be  very  diflicult.  Furthermore,  timeliness 
and  fault-tolerance  requirements  could  pull  each  other  in  opposite  directions.  For  example,  checkpointing  a 
system  state  and  complex  recovery  mechanisms  will  enhance  fault -tolerance  but  may  increase  the  probability 
of  missing  a  deadline.  Hence,  one  must  explicitly  consider  timing  requirements  when  defining  a  consistent 
system  state.  Tlie  following  example  illustrates  this  point. 

Example  1:  Airspace  Control 

Consider  an  airplane  that  is  moving  from  air«j\xe  A  to  an  adjacent  airspace  B.  '  Different  air  traffic 
controllers  arc  responsible  for  each  airspace.  /  ihe  airplane  is  moving  from  airspace  A  to  airspace  B,  the 
control  must  be  passed  from  one  controllc;  system  to  the  other.  Two  data  objects,  Oa  and  Os,  rellect  wliich 
controller  is  responsible  for  the  airplrne.  If  Oa  =  h  controller  A  is  in  charge  of  the  airplane.  If  Oa  =  0, 
the  controller  A  is  not  responsible  for  the  airplane.  Initially,  Oa  =  1  A  Ob  =  0.  The  hand-off  must  take 
place  as  the  airplane  is  moving  from  one  airspace  to  the  other.  A  safety  property  of  the  system  is  that  there 
should  be  a  maximum  time  interval  of 500ms  during  which  both  data  objects  are  zero,  i.e.,  neither  controller 
is  responsible  for  the  airplane.  Suppose  a  process  Pi  updates  both  Oa  and  Ob,  and  Pz  and  Pi  are  the 
displaying  processes  for  controllers  A  and  B  respectively.  We  use  the  notation  'u;(0)  and  r(0)  to  denote  a 
write  operation  and  a  read  operation  to  an  object  0,  respectively. 

Pi;  w{Oa),w{Ob) 

P2-  r{OA),w{displayA) 

Pi-  r{0:.),w{displayB) 

If  the  above  safety  property  (or  time  constraint)  is  not  imposed  on  the  system,  any  interleaved  execution 
of  the  above  operations  is  acceptable.  Consider  the  following  execution  sequence: 

w{Oa),  r{OA),  r{OB),w{OB),  ■  -  - 

If  the  two  reads  are  separated  by  more  than  5(X)ms,  the  safety  property  is  violated.  Thus,  the  correct  relative 
ordering  of  operations  does  not  necessarily  ensure  temporal  correctness.  Hence,  other  constraints  must  be 
imposed  lo  ensure  this  performance  requirement.  In  this  example,  all  operations  in  Pj  must  be  performed 
within  .‘:(X)ms.  □ 

I'ault-tolerance  techniques  based  on  checkpointing  and  message  logging  ensure  that  after  a  failure,  a 
distributed  computation  recovers  to  a  global  state  which  is  reachable  from  its  initial  state.  There  are  several 
problems  in  applying  this  approach  to  real-time  systems:  First,  since  a  real-time  process  may  include  time 
explicitly  in  its  local  state,  the  definition  of  a  consistent  global  system  state  based  on  partial  (or  causal) 
ordering  of  messages  may  not  be  appropriate.  For  example,  consider  the  data  repository  in  a  flight  control 


'Tliis  is  a  variation  of  an  example  in  [10). 
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system.  Tlie  decision  to  delay  or  to  land  an  aircraft  is  based  on  the  “current”  position  and  status  of  the 
aircarfts  being  monitored.  Tlie  system  state  in  the  data  repository  is  consistent  if  tlie  timestamps  of  different 
data  items  representing  aircraft  positions  are  within  an  acceptable  tolerance.  Second,  restoring  (he  system  to 
an  earlier  slate  may  be  unnecessary  or  even  incorrect  in  these  applications.  In  certain  cases,  a  process  may 
resume  its  execution  at  a  predefined  state  and  obtain  its  input  directly  from  a  sensor  after  a  failure.  Third, 
since  a  timing  constraint  may  be  imposed  on  the  execution  of  a  process  (or  a  collection  of  processes),  a 
complex  recovery  mechanism  and  resuming  execution  in  an  earlier  state  may  result  in  missing  the  deadline. 

Example  2:  Distributed  Computation 

Figure  1  illustrates  three  cooperating  processes  Pj,  P2  and  P3  with  the  corresponding  checkpoints 
Si,  S2  and  S}.  Tlie  messages  labeled  a  from  process  Pi  to  P2  crosses  the  recovery  line  established  by 
the  checkpoints.  In  an  asynchronous  environment,  to  establish  a  consistent  cut,  message  a  is  logged 
synclironously  or  asynchronously  by  the  sender  or  the  receiver.  If  the  processes  in  Figure  I  are  real-time 
processes,  several  other  alternatives  for  establishing  a  consistent  system  state  may  be  possible.  If  the  state 
variable  v  updated  by  a  can  be  extrapolated  from  its  previous  values,  then  it  may  be  unnecessary  to  log 
the  message  to  establish  a  consistent  system  state.  Alternatively,  if  process  P2  is  a  periodic  process  and 
the  previous  value  of  v  (prior  to  the  checkpoint  at  S2)  is  within  a  predefined  distance  from  the  new  value 
of  V,  this  previous  value  of  v  can  be  used  in  case  the  process  suffers  a  failure  after  the  checkpoint  T2  and 
before  the  subsequent  checkpoint.  Another  alternative  may  be  to  take  checkpoints  based  on  absolute  time 
if  the  processor  clocks  are  synchronized  within  a  know  value  c.  For  example,  as  shown  in  Figure  2,  if  each 
process  takes  a  checkpoint  at  the  local  time  T,  then  a  recovery  interval  [T  -  e,T  +  e]  can  be  established. 
This  recovery  interval  can  be  used  to  define  a  consistent  global  state  to  which  the  system  can  be  restored 
after  a  failure. 
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Figure  2;  A  Recovery  Inlcrval 


3  Real-Time  Models  to  Support  Fault-Tolerance 

In  this  section,  we  propose  two  models  for  defining  a  system  state  in  a  real-time  computation.  The  models 
introduce  two  complementary  approaches  to  providing  fault-tolerance  in  real-time  systems.  Both  approaches 
incorporate  the  notion  of  time  into  the  definition  of  a  consistent  system  state.  Before  presenting  the  models, 
we  examine  brief  a  classification  of  data  dependencies  in  a  real-time  computation: 

•  Temporal  Dependency:  A  collection  of  data  items  whose  timestamps  must  be  witliin  a  predefined 
value  e  of  each  other  in  a  system  state  [10]. 

•  Value  Dependency:  A  collection  of  data  items  whose  values  must  be  within  a  predefined  tolerance  S 
in  a  system  state. 

•  Causal  Dependency;  Existence  of  a  data  item  is  dependent  on  the  existence  of  another  data  item. 

3.1  Server  State 

In  a  real-time  system,  it  is  important  to  use  the  values  of  data  objects  that  have  existed  at  approximately  the 
same  time.  For  example,  an  air  traffic  controller  monitoring  the  positions  of  several  aircrafts  must  view  the 
coordinates  that  are  taken  within  a  very  short  interval.  Hence,  a  set  of  temporal  constraints  must  be  enforced 
on  the  data  objects  in  a  system.  These  temporal  constraints  must  be  considered  when  defining  a  consistent 
system  state  in  a  real-time  environment.  Consequently,  a  system  state  restored  after  a  failure  must  satisfy 
these  temporal  constraints.  A  crucial  observation  is  that  the  decision  on  when  to  update  a  backup  copy  is 
often  driven  by  the  staleness  of  the  data  object  rather  than  the  relative  (causal)  order  of  message  exchanges 
between  processes. 

In  this  model,  a  real-time  system  is  seen  as  a  collection  of  services  provided  in  a  timely  and  dependable 
manner.  Each  service  is  provided  by  a  set  of  replicated  servers  running  on  multiple  processors.  (Tlie 
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replication  strategy  for  each  service  can  be  based  on  active  or  passive  replication.)  We  define  the  state 
of  a  system  as  the  collective  states  of  the  underlying  servers.  A  server  state,  in  turn,  is  delined  by  the 
set  of  objects  (including  the  values  and  timestamps)  internal  to  that  server.  Tliis  is  where  we  depart  from 
traditional  approaches  to  defining  a  system  state.  A  consistent  state  of  a  server  is  defined  by  the  set  of  the 
temporal  and  value  constraints  imposed  on  the  objects  in  the  server.  In  other  words,  a  consistent  server  state 
is  the  temporally  correct  snapshot  of  the  objects  maintained  by  the  server.  In  a  automated  process  control 
system,  for  example,  the  algorithms  to  monitor  and  control  an  external  device  are  executed  periodically. 
The  result  from  the  execution  of  an  algorithm  can  be  updated  on  a  backup  server  to  tolerate  against  the 
primary  failure.  The  stale  of  this  server  is  the  set  of  input  and  output  values  during  each  iteration  of  the 
program  execution.  This  is  the  state  that  must  be  restored  (in  a  timely  fashion)  if  the  primary  server  fails. 

Figure  3  illustrates  a  server  state.  The  objects  in  a  server  arc  denoted  by  circles;  the  constraints  imposed 
on  collections  of  objects  are  denoted  by  squares.  An  update  to  an  object  or  the  passage  of  time  must  preserve 
the  constraints  imposed  on  the  objects.  A  definition  of  a  server  state  based  on  the  notions  of  temporal  and 
value  dependencies  is  given  below. 

Definition: 

A  server  state  is  defined  to  be  a  ordered  collection  of  objects  {0\,02,  ■  -  ■,On)  with  v{Ot)  and  t(0,) 
denoting  the  value  and  timestamp  of  object  Oi. 
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Supj'HKSc  Ihc  curiciU  tiiac  is  vlcnoicil  by  /  ,\  'i-ncr  slate  ts  i  c/i if n  j  set  .a  tinssiuai!--  ait  safisfR-d 


Vi,j  lt{0,) 

^£7 

ly(Oj  niOj): 

'Hie  above  ilclinitioit  reot  ires  a  eunsislt-m  ssstciii  state  ti>  satisfy  a  set  >>!  icnijsnial  aiu!  \aiue  dcj»t  n.le!K  y 
consttatnis  Theses  constraints  are  explicttly  iiii)s>seh  on  pairs  ot  otijccis  liiat  loiic.ti’.ci-.  Joini  a  setver 
state.  Alter  detectniL’  a  laiktrc.  the  server  must  he  resioret!  to  a  state  that  sati'hes  ifie  leiiipe.!!  aru!  value 
dependencies  Ihe  parameter  /’  iiuhcaies  that  >  ami  <“  are  not  necessarily  constants  Pea  evample  the  value 
oJ  t  can  ilsetl  be  a  runction  ul  the  distance  between  ttie  iiincstamps  oi  iIk-  fUii  ot  objects  and  the  leneth  ot 
time  since  the  minimum  nl  the  two  timestamps 

Tfie  above  delinition  ior  a  eonsisient  setver  staieconsidetsordv  the  last  u|vlaic  to  an  oisjCct.  i  e  ,  the  tmist 
recent  value  and  Itmestamp  lor  an  object  In  cctiain  cases,  a  (sounded  hisiofy  of  recent  ujvj.iies  to  an  id'iect 
cun  he  exploited  lo  ilclmc  a  consistent  setver  state  (  or  example,  consider  a  task  tbal  fx-nodic.tlly  ujvlaies  a 
stale  vartabtc  based  on  Ihe  ealeulaiion  m  the  previou'  invoc aiion  oi  the  task  and  the  readme  ol  a  new  value 
Irom  a  sensor  A  backup  processor  will  execute  this  task  pcTKHiiealty  rl  the  primary  fat!'  (t  ts  unnecess.iry 

10  send  the  uptime  Irom  each  iteration  to  the  backup  il  an  approximate  value  can  lx-  extrafx'laied  horn  ear  iter 
ilcralions  that  have  already  heen  hneeci!  on  the  backup  Tlie  at>ovc  dehmtton  lor  a  eunsisienl  stale  can  be 
extended  lo  include  semantic  inlorntaiion  about  the  history  ol  teeent  updates  lo  an  object  Supfxise  (),  j. 
denotes  Ilie  alb  most  rceeni  U{xlaie  to  an  objeel  <),  'Hte  lire  followin”  eonsiraini  mijxises  lower  anil  upjxt 
bounds  on  die  value  ol  the  last  updale  based  on  die  i-th  rrioci  fceerii  uplaies 

A(k.T)  £  v(0.1  11)  ;  y.i^/r) 

life  functions  /  and  5  csiahlisb  die  lower  and  uppc.T  bounds  on  an  aeccpiable  value  ol  that  can  K’  restored 
alter  a  failure.  Tltc  follinvinii  simple  constraint  imposes  lower  and  upper  (sounds  on  die  dillerenee  (xhwec’n 
die  last  two  updates 

A(T)  <  iv(0,[-^l|)  r(0,[-  21) 

11  Ihe  second  mo.sl  recent  updale  lo  O.  has  been  sent  to  a  backup  processor,  the  a(>ove  eonsir.uni  can  he  used 
to  determine  whether  the  most  recent  updale  must  he  senl  to  the  backup  processor  as  well 

3.2  State  of  Cooperating  Proce.sses 

In  this  model,  a  real-time  system  is  seen  as  a  set  of  cooperaline  (perhaps  pcruKhc )  processes  that  comrminteaic 
by  exchanging  messages  Procc.sses  go  through  miernai  state  transitions  due  to  interna!  computation,  receipt 
of  a  message  or  pa.ssage  of  time.  Tfie  global  system  state  can  be  cbaracten/cd  by  die  set  ol  local  process 
states  and  the  mes.sagcs  that  have  been  sent  by  the  sender  hut  not  applied  to  the  lixal  state  Instead  ot 
viewing  a  global  slate  as  a  cimsistcnl  cut  of  states  of  processes  at  some  logical  instant  of  time,  such  as 
(3, 5,  8. 9,  12,  141,  wc  define  a  system  state  at  an  instant  of  real-time  in  the  same  spirit  as  in  [7.  1 1.  1*']  We 
a.ssume  that  the  processor  clocks  in  tfic  system  arc  approximately  synchroni/ed  within  a  know  deviation  » 

Definition: 

Suppose  P  is  a  real-time  distributed  system  consisting  of  a  set  of  n  processes  pi  .p’.  .  Pn  33ie  global 
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system  state  at  time  t,  denoted  by  5(f).  is  delined  as 

a)  <  5|(f),52(0>‘  ■  >.  such  that  s,(f)  is  the  local  state  ol  the  process  p,  at  time  f,  where  t,  =; 

min(f,  r,)  and  r,  is  the  time  of  the  receipt  of  the  first  message  by  p,  that  was  sent  by  another  process  Pj 
aftc’'  \  reached  its  local  state  Sj{t). 

b)  VI  <  r,  j  <  n,  such  that  rn,  j{t)  is  the  sot  of  lifo  messages  from  p,  to  pj  timestamped  by  p, 
before  its  local  time  f,  and  not  received  by  P;  until  after  its  local  slate  Sj{t). 

Part  (a)  in  the  above  delinilion  refers  to  the  collection  of  local  process  states  and  pan  (b)  covers  the  logical 
message  queue  between  each  pair  of  processes, 

Tlie  above  definition  eastablishcd  a  recovery  interval  to  winch  a  system  can  be  restored  alter  a  failure 
However,  it  still  attempts  to  preserve  the  causal  order  when  establishing  a  consistent  system  state.  It  is 
possible  to  relax  the  above  delinilion  such  that  a  weaker  notion  of  a  system  state  can  be  obtained.  f)ne 
possible  approach  is  to  enforce  constraints  similar  to  those  in  section  3. 1  to  determine  whether  an  update  (a 
message)  from  a  sender  should  be  logged  during  the  recovery  interval. 

4  Concluding  Remarks 

As  embedded  real-time  systems  become  more  sophisticated,  the  ability  of  the  system  to  provide  dependable 
and  timely  service  becomes  critical.  The  focus  of  this  extended  abstract  was  on  alt  alive  models  for 
defining  a  system  stale  in  a  real-time  computation.  It  was  argued  that  the  traditional  approaches  to  fault - 
tolerance  in  asynchronous  systems  arc  not  suitable  for  a  real-time  environment.  Recovering  a  system  to  a 
consistent  state  after  a  failure  must  consider  the  timing  requirements  imposed  on  die  system.  Tlie  two  models 
that  were  presented  in  this  abstract  consider  time  explicitly  in  defining  a  consistent  system  state.  In  one 
model,  a  system  stale  consists  of  a  collection  of  temporally  related  objects.  A  system  failure  would  require 
restoration  of  the  system  to  a  state  in  which  the  temporal  and  value  dependency  requirements  among  objects 
are  satisfied.  In  the  second  model,  a  sys*cm  state  is  defined  at  an  absolute  lime.  Due  to  the  approximately 
synchronized  proce.ssor  clocks,  a  recovery  interval  is  established  to  which  a  system  can  be  restored  after  a 
system  failure. 
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Abstract 


In  fault-tolerant  distributed  systems,  different  non-faulty  processes  may  arrive  at  different 
values  for  a  given  system  parameter.  To  resolve  this  disagreement,  processes  must  exchange 
and  vote  upon  their  respective  local  values.  During  voting,  faiilty  processes  may  attempt 
to  inhibit  agreement  by  acting  in  a  malicious  or  “Byzantine”  manner.  Approximate  Agree¬ 
ment  defines  a  form  of  agreement  in  which  the  voted  values  obtained  by  the  non-faulty 
processes  need  only  agree  to  within  a  predefined  tolerance.  Approximate  Agreement  can 
be  achieved  by  a  sequence  of  convergent  voting  rounds,  in  which  the  range  of  values  held 
by  non-faulty  processes  is  reduced  in  each  round.  Existing  convergent  voting  algorithms 
assume  complete  connectivity  between  processes.  Where  the  physical  connectivity  is  in¬ 
complete,  messages  must  be  relayed  between  processors  to  simulate  complete  connectivity. 
For  large,  sparsely  connected  systems,  the  message  traffic  associated  with  message  relaying 
could  be  prohibitive,  making  Approximate  Agreement  infeasible  for  such  systems. 

This  paper  presents  a  means  of  implementing  convergent  voting  in  large  sparsely  connected 
networks  without  the  massive  communication  overhead  incurred  by  the  global  relaying  of 
messages.  Simple  expressions  are  presented  for  the  convergence  rates  and  robustness  of 
a  broad  family  of  low-overhead  locally  convergent  voting  algorithms  in  the  simultaneous 
presence  of  multiple  fault  modes.  These  expressions  are  employed  to  determine  the  ro¬ 
bustness  of  local  convergence  in  some  commonly  used  partially  connected  networks.  Issues 
affecting  global  convergence  are  also  addressed,  and  the  extension  of  the  results  to  several 
related  problems  is  discussed. 
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1  Introduction 


As  distributed  computing  systems  become  larger,  there  is  a  growing  conflict  between  the 
increasing  need  for  fault-tolerance  and  the  computational  overhead  required  by  fault- 
tolerance  mechanisms.  Even  in  systems  containing  less  than  a  dozen  processing  nodes, 
fault-tolerance  overhead  can  consume  a  majority  of  the  processing  power  [Pal85,  Cze85]. 
In  contrast,  systems  are  now  being  built  with  hundreds  or  thousands  of  nodes.  This  pa¬ 
per  presents  a  method  of  adapting  one  important  fault- tolerance  mechanism  to  very  large 
systems  while  maintaining  the  overhead  of  a  small  system. 

An  important  issue  in  distributed  fault-tolerance  is  ensuring  that  all  non-faulty  processes 
agree  on  the  values  of  critical  data  items  despite  active  interference  from  faulty  processes. 
This  issue  arises  whenever  non-faulty  processes  can  legitimately  form  differing  “opinions” 
regarding  a  specific  value.  They  must  then  exchange  and  vote  upon  their  local  values  to 
arrive  at  a  single  consensus  value.  If  a  faulty  process  is  constrained  to  send  the  same 
erroneous  value  to  all  non-faulty  processes,  then  simple  majority  voting  is  sufficient  to 
provide  immediate  agreement.  It  is  only  necessary  that  the  majority  of  the  processes  be 
non-faulty.  Reaching  agreement  becomes  significantly  more  difficult  if  a  faulty  process  is 
permitted  to  send  conflicting  vaJues  to  different  non-faulty  processes.  A  faulty  process 
with  this  property  has  been  called  malicious,  two-faced,  Byzantine,  or  asymmetric. 

The  classic  form  of  distributed  agreement,  Byzantine  Agreement,  requires  that  all  non- 
faulty  processes  obtain  identical  voted  values  for  any  set  of  initial  values.  However,  many 
applications  do  not  require  non-faulty  processes  to  achieve  exact  agreement.  Rather,  they 
need  only  agree  on  a  value  to  within  a  specified  tolerance.  This  state,  called  Approximate 
Agreement,  is  useful  in  areas  such  as  sensor  data  management  and  fault-tolerant  clock 
synchronization  [Kie88,  Lam85,  Lun84,  Sch87,  Tha89].  Given  an  arbi.-arily  small  positive 
real  value  t,  Approximate  Agreement  is  defined  by  two  conditions  [Dol83,  D0I86]: 

Agreement  —  The  voting  algorithms  executed  by  all  non-faulty  processes  eventually  halt 
with  voted  values  that  are  within  e  of  each  other. 

Validity  —  The  voted  value  held  by  each  non-faulty  process  is  within  the  range  of  the 
initial  values  held  by  all  non-faulty  processes. 

Most  Approximate  Agreement  algorithms  employ  multiple  rounds  of  message  exchange 
interleaved  with  a  convergent  voting  algorithm  which  guarantees  that  the  range  of  values 
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held  by  the  non-faulty  processes  is  reduced  in  each  round  [Dol83,  D0I86,  Kie91,  Larn85, 
Vas89].  This  property,  called  single-step  convergence,  guarantees  that  the  range  of  values 
will  event uadly  be  less  than  e,  given  enough  rounds. 

Large  distributed  systems  rely  upon  partially  connected  networks  for  interprocess  commu 
nication  [FenSl,  Hwa84].  However,  convergent  voting  algorithms  have  been  derived  only 
for  systems  which  are  completely  connected.  In  systems  with  partial  physical  intercon¬ 
nections,  the  voting  processes  must  relay  messages  such  that  the  system  connectivity  is 
logically  complete.  If  the  total  number  of  processes  is  large,  the  global  exchange  of  lo¬ 
cal  values  can  consume  a  great  deal  of  time  and  communication  resources.  As  a  result, 
convergent  voting  algorithms  are  not  practical  in  large  sparsely  connected  systems. 

Most  analyses  of  convergent  voting  assume  that  all  faults  exhibit  asymmetric  or  Byzantine 
behavior  [Dol83,  D0I86,  Lam85].  In  reality,  asymmetry  occurs  only  under  complex  and 
improbable  conditions.  Thus,  if  coincident  faults  occur,  it  is  higldy  unlikely  that  all  faults 
will  be  Byzantine  in  nature.  Recently,  the  behavior  of  convergent  voting  algorithms  has 
been  analyzed  in  the  simultaneous  presence  of  three  distinct  modes  of  faults:  asymmetric 
(Byzantine),  symmetric,  and  benign  (self-incriminating)  [Kie9l].  This  analysis  showed 
that  convergent  voting  can  be  significantly  more  robust  than  predicted  by  the  single-mode 
Byzantine  fault  model. 

This  paper  presents  a  means  for  limiting  the  overhead  of  achieving  Approximate  Agreement 
in  large  sparsely  connected  systems.  The  general  approach  is  to  prohibit  the  relay  of 
convergent  voting  messages.  Thus,  each  processor  performs  convergent  voting  only  with 
its  immediate  neighbors.  The  objectives  are:  (1)  to  present  low-overhead  convergent  voting 
algorithms  which  function  without  message  relay,  (2)  to  analyze  the  convergence  rates  of 
these  algorithms  using  a  mixed-mode  fault  model,  (3)  to  determine  the  theoretical  bounds 
on  their  fault-tolerance  as  a  function  of  the  topology  and  connectivity  of  the  network. 

These  results  make  Approximate  Agreement  feasible  for  very  large  distributed  systems. 
They  thus  facilitate  confident  design  and  verification  of  distributed  processes  such  as  clock 
synchronization  and  redundant  sensor  management.  In  addition,  the  methodology  em¬ 
ployed  is  shown  to  be  extendable  to  several  related  problems. 
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2  Background 


Distributed  systems  have  been  partitioned  into  two  distinct  classes,  synchronous  and  asyn¬ 
chronous  systems  [Dol83].  In  a  synchronous  system  there  are  finite  bounds  on  the  pro¬ 
cessing  and  communication  delays  of  non-faulty  processes.  There  is  thus  a  point  in  time 
by  which  any  process  executing  a  convergent  voting  algorithm  will  have  received  all  data 
from  all  non-faulty  processes.  Any  data  arriving  after  that  time  must  have  come  from  a 
faulty  process.  In  an  asynchronous  system,  no  finite  bounds  on  process  operation  exist,  so 
that  a  process  might  have  to  wait  “forever”  to  receive  data  from  all  non-faulty  processes. 
It  is  thus  impossible  to  differentiate  between  a  slow  non-faulty  process  and  a  “dead”  faulty 
process. 

This  paper  addresses  synchronous  systems  only.  The  synchronous  system  model  is  most 
representative  of  real-time  systems,  and  is  applicable  to  both  data  voting  and  clock  syn¬ 
chronization.  In  addition,  synchronous  systems  require  the  participation  of  fewer  processes 
than  asynchronous  systems  {Dol83,  Dol86j.  This  fact  can  be  of  great  importance  in  sparsely 
connected  systems,  where  communication  is  constrained.  A  similar  approach  can  be  ap¬ 
plied  to  asynchronous  systems  to  produce  results  analogous  to  those  presented  herein. 

Most  of  the  previous  research  in  convergent  voting  assumed  a  completely  connected  net¬ 
work  of  processors.  In  systems  with  partial  interconnenction,  messages  must  be  relayed 
such  that  each  non-faulty  process  receives  a  value  from  every  other  non-faulty  process. 
Two  approaches  have  been  taken  to  reduce  the  overhead  of  global  message  exchange  in 
partially  connected  systems.  The  first  approach  considers  the  system  to  be  hierarchically 
composed  of  processor  clusters  (Shi87].  Within  each  cluster,  all  processors  are  completely 
connected.  One  processor  in  each  cluster  is  also  connected  to  one  processor  in  another 
cluster  such  that  the  set  of  clusters  is  completely  connected.  It  is  then  possible  to  set 
one  tolerance  on  agreement  within  a  cluster,  and  a  looser  tolerance  on  agreement  between 
clusters.  The  second  approach  employs  special  purpose  commimication  hardware  to  in¬ 
crease  the  efficiency  of  handling  relayed  messages  (Ram90].  However,  this  approach  does 
not  reduce  the  overall  complexity  of  the  message  traffic. 

Several  convergent  voting  algorithms  have  been  derived,  such  as  the  Fault-Tolerant  Mean 
[Dol83],  the  Fault-Tolerant  Midpoint  (Mean  of  Medial  Extremes)  [Dol83],  the  Interactive 
Convergence  algorithm  [Lam85],  and  Dolev’s  Optimal  algorithm  [D0I86].  Each  algorithm 
required  ad-hoc  proofs  of  its  fault-tolerance  and  convergence  properties.  Furthermore, 
these  analyses  were  only  valid  under  a  single-mode  Byzantine  fault  model. 
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In  recent  work,  a  large  family  of  voting  algorithms  was  defined,  called  Mean- Subsequence- 
Reduced  (MSR)  algorithms  [Kie9l].  The  MSR  family  encompasses  several  of  the  previously 
known  voting  algorithms.  Simple  expressions  have  been  derived  for  the  fault-tolerance  and 
convergence  properties  of  any  MSR  algorithm  under  a  mixed-mode  fault-model.  The  work 
presented  here  extends  this  analysis  to  partially  connected  systems  with  no  relay  of  voting 
messages.  We  begin  with  some  necessary  definitions  and  background. 

2.1  Real- Valued  Multisets 

Approximate  Agreement  requires  the  manipulation  of  multisets  of  real  values.  A  multiset 
is  a  collection  of  objects  similar  to  a  set.  However,  it  differs  from  a  set  in  that  the  elements 
of  a  multiset  need  not  be  distinct.  For  example,  a  set  of  real  numbers  contains  no  more 
than  one  occurrence  of  any  given  value,  while  a  multiset  of  real  numbers  may  contain 
multiple  occurrences  of  the  same  value.  The  number  of  times  a  particular  object  (value) 
appears  in  a  multiset  is  called  the  Multiplicity  of  that  object.  A  finite  multiset  V  of  real 
values  may  be  represented  as  a  mapping  V  :  3?  — »  K.  For  each  real  value  r,  V(r)  is  defined 
as  the  multiplicity  of  r  in  V.  The  size  of  V  is  F  =  |V|  = 

An  alternative  representation  for  a  multiset  of  real  numbers  is  a  monotonically  increasing 
sequence  of  the  real  values  of  its  elements,  i.e.  V  =  (ui,  ...  ,vv)  ordered  such  that: 

Vi+i  V  t  €  {1,  ...  ,F  —  1}  [And63,  Liu85].  Both  representations  of  a  multiset  are 
equivalent,  but  for  certain  operations  one  form  or  the  other  is  more  convenient.  To  avoid 
confusion,  we  use  upper-case  symbols  for  multiplicities  in  the  real-to-integer  mapping  form, 
e.g.  F(r).  Similarly,  we  use  angle-braces  and  lower-case  symbols  for  elements  in  the 
sequence  form,  e.g.  V  =  (vi,  ...  ,uv)  =  (tJ,)  V  t  G  {1,  ...  ,F}. 

Real-Valued  Parameters  —  A  multiset  of  real  numbers  has  several  useful  real-valued 
parameters. 

min{V)  =  min  (r  6  R  :  F(r)  >  0)  =  t’l;  the  minimum  value  of  the  elements  in  V. 

m«ix(V)  =  max(r  G  3?  :  V[t)  >  0)  =  vy]  the  maximum  value  of  the  elements  in  V. 

p(V)  =  [min(V),max(V)]  =  (vi,wv];  the  real  interval  spanned  by  V.  p(V)  is 

called  the  range  of  V. 

^(V)  =  max(V)  —  min(V)  =  vy  —  Vi;  the  difference  between  the  maximum  and 

minimum  values  of  V.  ^(V)  is  called  the  diameter  of  V. 
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mean(V)  =  The  arithmetic  mean  of  the  real  values  of  all  elements  of  V ; 


mean(V) 


Multiset  Relations  —  Two  multisets  U  and  V  may  be  related  by: 

Union:  Let  W  =  V  U  U.  Then  W(r)  =  max[  V’(r),l!7(r)  ]  V  r  G  3?. 
Intersection:  Let  W  =  V  n  U.  Then  W’(r)  =  min  [V{‘r),U{r)  ]  V  r  G  3?. 
Sum:  Let  W  =  V  +  U.  Then  W{r)  =  F(r)  +  U{r)  V  r  G 

Difference:  Let  W  =  V  —  U.  Then  V  r  G  3?: 

^  [V{r)-U{r)  i(V{r)  >U{r) 

1  0  otherwise 


Subsequences  -  Given  two  sequences  U  and  V,  U  is  a  subsequence  of  V  if  all  elements 
of  U  are  selected  from  the  elements  of  V,  and  arranged  in  the  same  order  as  their  relative 
order  in  V.  While  a  subsequence  is  also  a  submultiset,  it  has  the  important  property  that 
the  index  of  an  element  in  V  is  the  sole  criterion  for  its  inclusion  in  U.  Thus,  the  function 
which  selects  elements  of  V  to  be  included  in  U  is  a  mapping  from  the  indices  of  U  to  the 
indices  of  V, 

Formally,  let  ly  =  {1,  ...  ,  F}  be  the  set  of  indices  for  multiset  V,  and  let  It/  =  {l,  ...  ,17} 
be  the  set  of  indices  for  mviltiset  U.  Then,  U  is  a  subsequence  of  V  if  there  is  a  fixed  one- 
to-one  (injective)  mapping  function  k  :  lu  — >  Iv  which  preserves  order.  Thus,  each  index 
i  G  {1,  ...  ,  U}  corresponds  to  exactly  one  index  k{i)  €  {1,  ...  ,  F},  where  k{i)  <  k(i-i-l). 
It  follows  that  Ui  =  Vfc(i)-  Furthermore,  since  V  is  a  monotonically  increasing  sequence  of 
real  numbers,  u;  <  “(i+i)  ^  ^  {1>  .  •  ■  >  —  1}- 

2.2  Multiple  Mode  Fault  Model 

In  real-world  systems,  truly  Byzantine  behavior  occurs  only  under  highly  improbable  con¬ 
ditions.  Accordingly,  Meyer  and  Pradhan  [Mey87]  have  partitioned  the  space  of  all  possible 
faults  into  two  distinct  modes:  Benign  faults,  defined  as  those  which  are  self-incriminating. 
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or  immediately  self-evident  to  all  non-faulty  processes,  and  Malicious  faults,  defined  as 
all  faults  which  do  not  qualify  as  benign.  Thambidurai  and  Park  [Tha88]  have  further 
partitioned  malicious  faxilts  into  two  sub-modes:  Symmetric  faults,  whose  behavior  is  per¬ 
ceived  identically  by  all  non-faulty  processes,  and  Asymmetric  faults,  whose  behavior  may 
be  perceived  differently  by  different  non-faulty  processes. 

The  total  number  of  faulty  processes  <  in  a  system  is  given  by  t  =  a  -f  s  -f  6,  where  a 
is  the  number  of  asymmetric  faults,  s  is  the  number  of  symmetric  faults,  and  h  is  the 
number  of  benign  faults.  It  has  been  shown  that  if  all  faults  are  treated  as  asymmetric, 
then  convergence  is  possible  only  if  iV  >  3t  -1-  1  [Dol83].  This  is  the  standard  single¬ 
mode  Byzantine  fault  model.  Similarly,  if  all  faults  are  treated  as  symmetric,  then  simple 
majority  voting  is  sufficient,  so  that  convergence  can  be  guaranteed  if  iV  >  2t  +  1.  Finally, 
if  all  faults  are  benign,  then  only  one  non-faulty  process  is  required,  i.e.  JV  >  t  -t-  1. 

The  results  to  be  presented  here  are  based  on  the  simultaneous  presence  of  all  three  fault 
modes.  Using  the  mixed-mode  fault  model,  N  is  determined  as  a  function  of  a,  s,  and 
6,  rather  than  t.  This  model  is  complete  in  that  all  possible  fault  modes  are  considered. 
However,  it  is  not  unrealistically  conservative,  as  is  the  single-mode  Byzantine  fault  model. 


2.3  Convergence  in  Completely  Connected  Systems 

Single-step  convergence  is  formally  defined  in  terms  of  the  following. 

V,  =  The  multiset  of  values  received  in  one  roimd  by  arbitrary  non-faulty  process  i. 

V  =  |Vi|.  If  less  than  V  values  are  received,  then  an  arbitrary  default  value  is  chosen 

for  each  missing  value  so  that  V  is  identical  for  all  non-faulty  processes. 

U,-  =  The  submultiset  of  correct  values  in  V,-,  i.e.  those  values  generated  by  non-faulty 

processes. 

U,nj  =  U,-  n  U j ,  the  multiset  of  correct  values  received  identically  by  two  processes  i  and 
j.  With  complete  connectivity,  Uin^  is  identical  for  all  non-faulty  processes.  U,nj 
may  thus  be  taken  as  the  multiset  of  correct  values  received  by  all  non-faulty 
processes. 

Each  non-faulty  process  i  executes  a  convergent  voting  eJgorithm,  producing  voted  value 

F{Vj).  A  voting  algorithm  is  convergent  if  both  of  the  following  conditions  are  true  for 

every  voting  round: 
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[Cl]  For  each  non-faulty  process  i,  the  voted  value  is  within  the  range  of  correct  values, 
i.e.  F(V.)  €  p(U.ni). 

[C2]  For  each  pair  of  non-faulty  processes,  i  and  j,  the  difference  between  their  voted 
values  is  strictly  less  than  the  diameter  of  the  multiset  of  correct  values  received, 
i.e.  1F(V,) -F(Vj)|  <  CS{Vinj),  where  0  <  C  <  1. 

Parameter  C  is  the  Convergence  Rate  of  a  voting  algorithm,  the  primary  measure  of  its 
effectiveness.  The  Robustness  of  a  voting  algorithm  is  the  minimum  number  of  processes 
N  required  to  tolerate  t  faults. 


2.4  MSR  Voting  Algorithms 

In  completely  connected  systems,  convergence  properties  have  been  determined  [Kie9l]  for 
an  entire  family  of  voting  algorithms  with  the  general  form: 

F(V)  =  mean  [5e/^  (/led’- (V))] . 

Function  Red'’,  called  the  Reduction  function,  removes  the  t  largest  and  t  smallest  ele¬ 
ments  from  multiset  V.  The  function  Sela,  called  the  Selection  function,  selects  a  submul¬ 
tiset  of  a  elements  from  the  reduced  multiset  Red'^  (V)*  If  produces  a  subsequence 
of  Red^  (V),  then  F(V)  is  the  Mean  of  a  Subsequence  of  the  Reduced  multiset.  Voting 
algorithms  with  this  property  are  called  Mean-Subsequence-Reduced  (MSR)  algorithms 
[Kie9l]. 

Members  of  the  MSR  family  differ  from  each  other  only  in  their  definition  of  the  selection 
function  Selt,.  Simple  expressions  have  been  found  for  the  convergence  rate  and  robustness 
of  any  MSR  algorithm  in  a  completely  connected  system.  These  results  show  that  it  is 
advantageous  to  discard  recognized  benign  errors  prior  to  voting  provided  that  all  processes 
do  so.  Thus,  given  a  total  of  N  processes  containing  b  benign  faults,  V  =  N  —  b. 


Robustness  —  For  a  completely  connected  synchronous  system  containing  a  asymmetric 
faults  and  s  symmetric  faults,  it  has  been  shown  that  an  MSR  voting  algorithm  can  be 
convergent  only  if  [Kie91]: 

T  >  a-\-  s 

f  >  1  :  a  =  0 

\  >  2  :  a>0 
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(2.1) 

(2.2) 


V  >  2r  +  max{  a  +  1,<t  ). 


(2.3) 


Substituting  the  minimum  allowable  values  for  cr  and  r  from  (2.1)  and  (2.2)  into  (2.3) 
shows  that  a  convergent  MSR  voting  algorithm  may  exist  only  if: 


hence  : 


T  >  Tc 

=  a  -1-  s 

(2.4) 

V>Vc 

=  3c  2s  -|- 1 

(2.5) 

N>Nc 

=  3c  -j-  25  -b  6  -f- 1. 

(2.6) 

Convergence  Rate  --  An  important  result  of  [Kie9l]  is  the  ease  with  which  the  con¬ 
vergence  rate  C  can  be  determined  for  any  MSR  voting  algorithm.  To  begin,  we  de¬ 
fine  the  Medial  Multiset  M  =  Red^  (V)  =  (mj,  ...  ,mnf),  and  the  Selected  Multiset 
S  =  Sel^  (M)  =  (.si,  ...  ,  Str).  By  the  definition  of  an  MSR  algorithm,  S  is  a  subsequence 
of  M.  Thus,  if  g  is  an  index  into  S,  then  for  each  integer  g  G  {1,  ...  ,<7}  there  exists 
exactly  one  integer  k(g)  G  {1,  ...  ,  M}  which  guarantees  that  Sg  =  TT^k(g)- 

Given  two  indices  g,h  ^  {1,  ...  ,cr}  where  5  <  A,  we  define  Ak{g,h)  as  the  number  of 
elements  in  M  spanned  by  elements  {sg,  ...  ,Sh)  in  S,  i.e. 

Ak(g,h)  =  (*(/.) -t(s)l.  (2.7) 


l{  g  =  h,  then  Ak{g,h)  =  0.  However,  if  5  <  h,  then  Ak{g,h)  is  the  number  of  elements 
of  M  in  the  submultiset  ... 


For  a  given  number  of  asymmetric  faults,  a,  it  will  be  useful  to  know  the  minimum  value 
of  (h  —  g)  for  which  Ak{g,h)  >  a.  Thus,  for  each  5  G  {1,  ...  ,  cr},  we  define  the  quantity 
Ag  as  follows: 


IF:  Ak{g,<T)  >  a, 

THEN:  Ag  =  the  minimum  value  of  (h  —  g)  such  that  Ak(g,  h)  >  a, 
ELSE:  Ag  does  not  exist  for  this  value  of  g. 


The  ELSE  clause  is  required  because  if  Ak{g,(T)  <  a,  then  there  is  no  h  G  -f  1,  ...  ,cr} 
for  which  Ak{g,h)  >  a.  Thus,  if  Ag  exists,  then  Ak{g,g  +  Ap)  >  a- 


Finally,  we  define  the  parameter  7  as: 


max 

V  ge{l,  ... 


.(Ap). 


(2.8) 


7 
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Thus,  given  any  g  G  {1,  ...  ,£’■  —  7},  it  is  assured  that  Ak{g,g  +  7)  >  a.  In  other 
words,  the  submultiset  {sg,  ...  is  guaranteed  to  span  a  elements  of  M,  for  all 

g  G  {1,  ...  ,<7  —  7}.  It  has  been  shown  {Kie9l]  that  the  convergence  rate  of  an  MSR 
algorithm  is: 

C  =  (2.9) 

cr 

As  a  practical  matter,  obtaining  the  values  of  7  and  a  is  relatively  simple,  given  constants 
M  and  a,  and  a  specified  selection  function  Sel^-  Parameter  a  is  just  the  number  of 
elements  selected  by  5e/,x(M).  To  determine  7,  one  simply  determines  Ag  for  each  g  G 
{1,  ...  ,<r}  by  inspection  and  selects  the  maximum  thereof.  If  there  is  no  value  of  g  for 
which  Ag  exists,  then  7  does  not  exist,  and  the  algorithm  is  not  convergent. 


3  Local  Convergence  with  Partial  Connectivity 

A  partially  connected  system  differs  from  a  completely  connected  system  in  that  a  given 
process  does  not  receive  values  from  all  non-faulty  processes.  Rather,  it  receives  values 
only  from  a  specific  subset  of  processes.  There  are  now  two  types  of  convergence  to  be 
considered:  local  convergence  over  a  specified  subgraph,  and  global  convergence  over  the 
entire  system  graph. 

This  section  presents  theoretical  bounds  on  the  ability  to  achieve  loczJ  convergence  in 
partially  connected  systems.  The  results  of  [Kie9l]  are  extended  to  include  the  quantitative 
impact  of  topological  parameters  such  as  degree.  As  before,  simple  expressions  are  obtained 
for  the  convergence  rate  and  robustness  of  synchronous  MSR  algorithms  under  the  mixed¬ 
mode  fault  model.  For  brevity,  the  results  must  be  presented  without  proofs.  The  reader 
is  referred  to  [KieQla]  for  detailed  proofs. 

The  following  constraints  are  placed  on  the  system: 

1.  The  system  topology  is  a  non-hierarchical,  regular,  homogeneous,  undirected  graph 
of  N  processing  nodes,  each  with  degree  d. 

2.  Messages  received  by  a  voting  process  may  not  be  relayed  to  another  process.  Thus, 
the  physical  and  logical  connectivity  are  identical.  Each  voting  process  receives  its 
own  value  as  well  as  those  of  its  immediate  neighbors  so  that  F  =  d  -f  1  for  all 
non-faulty  processes. 
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3.  N  >>  V  so  that  “wrap-around”  effects  can  not  assist  the  local  convergence  process 
in  a  given  voting  round. 

The  following  sets  and  multisets  describe  the  relationships  between  the  values  received  by 
two  nodes  i  and  j  in  a  partially  connected  system. 

Pj  =  The  set  of  processes  whose  values  are  receivable  by  process  i  {Pj  is  similarly 
defined  for  process  j). 

P  ini  =  Pin  Pi,  the  set  of  processes  whose  values  are  receivable  by  both  process,  i  and 
process  j. 

Piui  =  Pi  U  Pi,  the  set  of  processes  whose  values  are  receivable  by  process  i,  process  j, 
or  both. 

U,ni  =  The  multiset  of  all  values  generated  by  non-faulty  processes  in  Pjni- 
U.xij  =  The  multiset  of  all  values  generated  by  non-faulty  processes  in  P^ui- 

In  a  completely  connected  system,  U,ni  =  U^ui*  However,  in  a  partially  connected  system, 
Ujni  C  Ujui-  We  therefore  define  two  types  of  local  convergence  for  partially  connected 
systems. 

Intersection  Convergencej  Given  a  voting  algorithm  F{Y),  two  processes  i  and  j  are 
Intersection  Convergent  if  the  following  conditions  are  both  true: 

[11]  F(V,)  €  p(U,ni),  and  F(V,)€p(Uini), 

[12]  |F(V,)  -  F(V,)1  <  C^(U,ni),  where  0  <  C  <  1. 

Union  Convergence;  Given  a  voting  algorithm  F(V),  two  processes  i  and  j  are  Union 
Convergent  if  the  following  conditions  are  both  true: 

[Ul]  F{Yi)  e  p(Uiui),  and  F{Yj)  e  p(Uiui), 

[U2]  \F{Yi)  -  F{Yi)\  <  C  5(Uiui),  where  0  <  C  <  1. 

A  major  difference  between  completely  connected  and  partially  connected  systems  is  the 
strategy  for  handling  benign  faults.  In  a  completely  connected  system,  benign  faults  can 
be  ignored  because  all  processes  can  delete  the  benign  errors  from  V  and  vote  with  a 
smaller  sized  multiset.  However,  in  a  partially  connected  system,  no  fault  is  self-evident 
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to  alt  processes.  Thus,  ignoring  self-evident  faults  would  cause  different  processes  to  vole 
using  different  sized  multisets  so  that  V'  would  not  be  identical  for  all  processes.  Thus, 
only  symmetric  and  asymmetric  faults  are  considered  in  this  analysis,  leaving  t  ti  ^  s. 


3.1  Intersection  Convergence 

The  conditions  necessary  to  ensure  that  two  processes  i  and  j  are  Intersection  Convergent 
can  be  derived  as  a  variant  on  the  completely  connected  system  previously  described.  We 
begin  with  the  following  definitions; 

a  =  The  number  of  asymmetrically  faulty  processes  in  P,o;. 
s  =  The  number  of  symmetrically  faulty  processes  in  P.nj. 

X  —  jPi!  —  IPiojl  =  IPjl  “  IPiOjii  ^hc  number  of  processes  whose  values  are  receivable  by 
either  i  or  j,  but  not  by  both. 

Each  process,  €  {Pi\Pinj}  communicates  with  proccs.s  i,  but  not  with  jirocess  j. 
Similarly,  each  process,  Xj  €  {Pj\P,ri^}  communicates  with  process  j,  but  not  with 
process  i.  In  the  worst  case,  two  processes  x,  and  x,  can  send  different  value.s  to  processes 
t  and  j,  respectively.  Since  i,  and  Xj  are  not  members  of  P.n;,  their  values  could  be  outsule 
of  p{U,nj).  Thus  :ach  process  pair  (z,,Xj)  can  have  the  same  impact  on  V,  and  as  a 
single  asymmetrically  faulty  process  in  P,nj-  This  effect  can  occur  regardless  of  the  fault 
status  of  I,  and  Xj.  There  arc  x  such  process  pairs,  which  can  behave  like  x  asymmetric 
faults.  We  thus  define  as  the  variant  on  A,  obtained  by  substituting  (  a  f  x  )  for  a. 

IF:  >  (a -h  x), 

THEN:  Aj  =  the  minimum  value  of  (h  —  ff)  such  that  t\k(g,h)  >  {a  4  x)t 
ELSE:  A'  does  not  exist  for  this  value  of  g. 

Parameter  7'  is  then  the  corresponding  variant  on  7,  i  e.: 


max 

V  j€{l.  ... 


(Al). 


Thus,  given  any  g  G  {1,  •  •  •  ,cr  —  7'},  it  is  assured  that  Ak{g,g  4  7')  >  {a  4  x)- 

It  can  be  shown  that  an  MSR  algorithm  can  be  Intersection  Convergent  only  if  [Kie91a]; 

T  >  {a  4  x]  +  ('^•1) 
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^  I  >  1  :  [a  +  xl-0 
\  >  2  :  [a  +  x]  >  0 

V  >  2t  4- Tnai(  (a  +  x)  +  1)  <^  )• 


(3.2) 


Furthermore,  the  convergence  rate  of  the  algorithm  is: 


<T 


(3.3) 


(3.4) 


Using  the  minimum  allowable  values  for  a  and  t  from  (3.1)  and  (3.2),  an  Intersection 
Convergent  MSR  voting  algorithm  may  exist  only  if: 

T  >  Ti  =  (a  +  s)  +  X  (3.5) 

V  >  Vj  ~  3a  +  2s  +  l  4-  (3x).  (3.6) 

The  minimum  size  of  P.nj  can  be  derived  from  (3.6)  by  noting  that  |P,njl  —  U  -  x-  Thus, 
Intersection  Convergence  between  processes  i  and  j  is  possible  only  if; 

IPinil  >  (3a  +  2s4-l)  4-  (2x).  (3.7) 

3.2  Union  Convergence 

Union  Convergence  requires  convergence  within  p  (U^ui)  rather  than  p  (U<n>)-  Accordingly, 
it  can  be  shown  that  Union  Convergence  is  possible  under  less  restrictive  conditions  than 
Intersection  Convergence. 

Two  processes  z,  €  {Pt\Pini}  and  Xj  €  {Pj\Pinj}  can  still  send  two  different  values  to 
processes  i  and  j  respectively.  However,  if  Xi  and  Xj  are  both  non-faulty  then  both  values 
are  within  p(Uiui)-  Thus  non-faulty  (z,,i,)  pairs  have  less  impact  on  Union  Convergence 
than  on  Intersection  Convergence. 

We  retain  the  previous  definitions  that  a  and  s  are  the  number  of  asymmetric  and  sym¬ 
metric  faults,  respectively  in  P,'nj-  We  then  define: 

/  =  The  maximum  number  of  faults  in  either  {P^\Pinj}  or  {Pj\Pini},  regardless  of 
the  fault  modes  exhibited. 

It  can  be  shown  that  an  MSR  algorithm  may  be  Union  Convergent  only  if  [Kie91a]: 

T  >  a  4-  s  -I-  / 
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(3.8) 


(3.9) 


_  J  >  1  :  +  X]  =  0 

1  >  2  :  [a +  x]>0 

V  >  2t  +  mai  (  [a  +  xl  +  1,  <^  )•  (3-10) 

Applying  the  minimum  values  for  cr  and  t  yields  the  bounds: 

T  >  Tu  =  (a +  5)  +  /  (3-11) 

V  >  Vu  =  (3a +  2^  +  1)  4-  (x  +  2/)  (3.12) 

IP.-nil  >  V-x  =  (3a  +  2s  +  l)  -!  (2/).  (3.13) 


If  an  MSR  algorithm  is  Union  Convergent,  then  the  convergence  rate  is  identical  to  that 
for  Intersection  Convergence,  i.e.  C  =  fcr. 


3.3  Summary 

Table  1  summarizes  the  relevant  parameters  for  convergence  in  completely  connected  sys¬ 
tems  and  for  Intersection  Convergence  or  Union  Convergence  in  partially  connected  sys¬ 
tems.  The  listed  bounds  on  r  and  V  are  minima,  beneath  which  convergence  can  not 
be  guaranteed.  These  bounds  are  tight,  because  there  exists  an  MSR  algorithm  which 
is  convergent  with  the  listed  minimal  parameters.  That  eilgorilhm  is  the  Fault-Tolerant 
Midpoint  {Dol83],  in  which  Self,  (M)  selects  the  two  extrema  of  M.  T'  s,  S  = 
so  that  a  =  2,  the  minimum  value  allowed  by  (2.2),  (3.2),  and  (3.9)  for  a  >  1. 


4  Network  Examples 

The  results  summarized  in  Table  1  show  the  general  bounds  on  convergence  in  regular 
homogeneous  network  graphs.  Applying  these  results  to  specific  error  scenarios  in  selected 
interconnection  topologies  illustrates  the  relative  robustness  of  these  networks  for  both 
types  of  local  convergence. 


4.1  Mesh  Networks 

Three  common  mesh  networks  are  shown  in  Figure  1,  with  degrees  d  —  A,  d  =  6,  and  d  =  8, 
respectively.  Since  each  node  also  receives  its  own  values,  these  degrees  yield  V  ~  5,  V  ~  7, 
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Completely 

Connected 

Partially  Connected 

Intersection 

Union 

T^-{a  +  s) 

T/  =  (a  -f  s)  -}-  X 

Tu  =  {a  +  s)  +  f 

V;  =  (3a  -f-  2s  -i-  1) 

Vj  =  (3a  -f  2s  -f  l)  (3x) 

Vu  =  (3a  +  2s  -1- 1)  -f-  (x  +  2/) 

IPinil  =  N  =  V  +  b 

!P,-nil>(3a-f2s-hl)  +  (2x) 

|Ptnj|  ^  (3a  -f-  2s  -f  1 )  -{-  (2/) 

^Ha^a  +  'r)  >  « 

^Ha^a  +  nf')  >  a  +  x 

^^(a,a  +  7')  >  a  +  X 

C  =  7/cr 

c  -  Yh 

C  =  Yl<r 

Table  1:  Summary  of  Convergence  Parameters 

and  V  =  9,  respectively.  For  each  network,  two  nodes  are  selected  and  labelled  i  and  j  such 
that  IPinjl  is  maximized.  In  Figure  1,  the  nodes  enclosed  within  a  dashed  box  comprise 
Pinj  for  that  mesh. 

Inspection  of  Figure  1  reveals  that  for  the  chosen  (i,j)  pair,  the  number  of  non-shared 
values  received  by  each  node  is  x  =  3  in  all  three  meshes.  In  the  best  case  of  a  fault-free 
system.  Table  1  shows  that  Intersection  Convergence  is  possible  only  if  F  >  V/  =  3x  +  1  = 
(3  X  3)  -H  1  =  10.  Since  V/  >  V  for  all  three  meshes,  there  exists  no  MSR  voting  algorithm 
which  is  Intersection  Convergent  for  anj  of  these  systems. 

In  a  fault  free  system,  Vu  =  x  +  l  =  3-fl  =  4.  Thus,  all  three  meshes  are  Union  Convergent 
in  the  fault-free  case.  Assuming  a  single-fault  scenario,  the  worst  case  would  be  if  that 
fault  were  asymmetric,  in  which  case  Vu  =  3a-}-x  +  l  =  3-f-3-t-l  =  7.  Thus,  in  any 
single-fault  scenario,  both  the  Hexagonal  and  Octagonal  meshes  are  Union  Convergent.  It 
can  also  be  shown  that  the  Octagonal  mesh  can  tolerate  a  double  fault  as  long  as  a  <  1, 
while  none  of  these  networks  can  tolerate  a  double  asymmetric  fault  or  any  triple  fault. 
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a:  Quadratic  Mesh 


Figure  1:  Common  Mesh  Networks 


4.2  Hypercubes 

In  a  hypercube,  the  degree  d  =  logj  ( jV),  so  that  V  =log2{iV)  +  l.  Each  node  is  connected 
to  all  nodes  whose  binary  address  is  at  a  Hamming  distance  of  unity  from  its  own  address. 
Thus,  for  any  two  nodes  i  and  <  2.  By  definition,  X  =  V'-IPin;!  >  _ 

2  =  log2(iV)  -  1. 

From  table  1,  the  fault-free  condition  for  Intersection- Convergence  is  V/  =  3x  + 
1  =  3[log2(iV)  —  l]  -b  1  .  Therefore,  an  Intersection  Convergent  MSR  algorithm  can 
exist  for  a  hypercube  only  if: 


V  >  V/, 

V  >  3x+l, 

log2  (N)  +  1  >  3  [logj  (N)  -  1]  +  1, 
log,(JV)  < 

Since  logj  (N)  must  be  an  integer,  the  largest  hypercube  for  which  an  Intersection- Conver¬ 
gent  MSR  algorithm  could  exist  is  defined  by  logj  (N)  =  1,  or  N  =  2.  This  is  the  trivial 
system  comprised  of  two  nodes  connected  by  a  single  link. 

Performing  a  similar  analysis  for  Union- Convergence  yields: 

V  >  Vc, 

V  >  x  +  1, 
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iogj  iN)  +  l  >  (logj  (iV)  -  1)  +  1, 
Jog2  W  +  1  >  log2  (N)  . 


Thus,  &ny  fault-free  hypercube  will  be  Union  Convergent.  However,  any  single  fault  within 
Pjuj  will  add  at  least  2  to  the  right-hand  side  of  this  expression,  making  nodes  i  ami  j 
non-con  vergent . 


5  Continuing  Research 

Table  1  shows  the  local  convergence  properties  of  MSR  algorithms  in  regular  homogeneous 
networks.  These  results  and  the  methods  used  to  obtain  them  are  serving  as  the  basis  for 
further  research  on  a  number  of  related  problems. 


5.1  Global  Convergence 

For  most  system  applications,  the  goal  of  convergent  voting  is  to  achieve  Approximate 
Agreement  on  a  global  level,  i.e.  to  reduce  the  range  of  values  held  by  all  non-faulty 
processes.  While  local  convergence  is  a  necessary  pre-condition  to  global  convergence,  it 
is  not  by  itself  sufficient  to  guarantee  global  convergence.  The  existence  and  rate  of  global 
convergence  also  depend  on  the  topology  of  the  system,  the  distribution  of  initial  values, 
and  the  distribution  of  faults  throughout  the  system. 

Single-step  local  convergence  does  not  necessarily  produce  single-step  global  convergence. 
At  the  global  level  convergence  may  be  asymptotic  rather  than  immediate  [Fek90].  There 
may  be  a  delay  of  several  rounds  before  the  global  diameter  of  correct  values  begins  to 
decrease.  Thus,  a  single-step  convergence  rate  is  an  inappropriate  performance  metric  for 
global  convergence.  Two  metrics  of  interest  are  the  maximum  number  of  rounds  required 
before  the  first  reduction  in  global  diameter,  and  the  long-term  average  (or  asymptotic) 
convergence  rate.  A  sub-family  of  MSR  algorithms  has  been  identified  which  appears  to 
minimize  both  of  these  metrics.  Current  efforts  are  directed  at  quantifying  the  performance 
of  the  global  convergence  process.  Recent  results  suggest  that  global  convergence  can  be 
guaranteed  only  if: 


1.  The  fault-free  system  can  achieve  local  Union  Convergence  between  nearest  neighbor 
nodes. 
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2.  The  number  of  errors  received  by  any  non-faulty  node  does  not  exceed  the  fault- 
tolerance  limits  set  in  Table  1  for  Union  Convergence  with  its  nearest  neighbors. 


Examples  have  shown  that  if  a  single  non-faulty  node  fails  to  meet  condition  2  above,  then 
the  entire  system  will  become  non-con vergent.  These  examples  indicate  that  global  con¬ 
vergence  is  extremely  sensitive  to  the  clustering  of  faults.  General  expressions  describing 
the  precise  conditions  required  for  global  convergence  to  occur  have  not  yet  been  derived. 


5.2  Asynchronous  Systems 

In  asynchronous  systems,  there  are  no  bounds  on  the  processing  time  or  message  delivery 
delay  for  non-faulty  processes  [Dol83].  Nonetheless,  convergence  is  still  possible.  Using 
the  single-mode  byzantine  fault  model  in  completely  connected  systems,  it  has  been  shown 
that  convergence  is  possible  if  iV  >  5t  -f-  1.  Current  work  indicates  that  for  completely 
connected  systems,  the  mixed-mode  model  yields  N  >  5a  -f  4s  -|-  1.  The  methods  used 
herein  for  synchronous  systems  can  be  extended  to  include  mixed-mode  partially  connected 
asynchronous  systems  as  well. 

5.3  Non-Homogeneous  Topologies 

The  results  in  Table  1  are  based  on  the  assumption  that  the  network  is  regular  and  ho¬ 
mogeneous.  Thus,  the  degree  and  connectivity  of  all  nodes  are  identical.  Specifically,  for 
any  two  voting  processes,  i  and  y,  the  size  of  the  voting  multiset  V  and  the  number  of 
non-shared  processes  x  identical. 

If  the  network  is  not  homogeneous,  then  one  must  deal  with  boundary  conditions  at  the 
edges  of  the  network.  Worse  still  is  an  irregular  network  graph,  in  which  case  nodes  t 
and  j  may  have  different-sized  voting  multisets.  Thus,  the  interactions  between  voting 
algorithms  in  regions  of  differing  connectivity  are  being  investigated. 


5.4  Omission  Faults 

In  many  systems,  the  most  likely  fault  is  an  asymmetric  ommission  fault  caused  by  a  faulty 
communication  link.  The  known  Approximate  Agreement  algorithms  can  not  exploit  this 
restriction  on  fault  behavior.  We  are  currently  studying  a  variant  of  the  MSR  family  of 
algorithms  called  Ommission-MSR  (OMSR).  Previously,  one  OMSR  algorithm  has  been 
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applied  to  clock  synchronization  in  a  system  where  asymmetric  ommission  fault :  were 
known  to  be  dominant  [ThaSS].  In  that  particular  case  the  selected  OMSK  algorithm  was 
more  robust  than  would  be  possible  for  any  MSR  algorithm.  Using  methods  similar  to 
those  employed  herein,  general  expressions  for  the  robustness  and  convergence  rates  of 
OMSK  algorithms  are  being  derived. 


5.5  Partial  Message  Relay 

At  this  point,  two  extreme  approaches  are  known  for  achieving  approximate  agreement. 
The  conventional  approach  employs  complete  message  relay  to  emulate  a  completely  con¬ 
nected  network,  while  the  approach  used  here  employs  no  message  relays.  The  complete 
relay  approach  offers  faster  convergence,  while  the  no  relay  approach  imposes  lower  mes¬ 
sage  passing  overhead.  There  is  a  continuum  of  partial  relay  policies  lying  between  these 
two  extremes.  For  example,  nodes  in  a  system  may  relay  only  those  messages  originated  by 
an  immediate  neighbor.  This  approach  can  yield  better  robustness  and  convergence  rates 
than  the  no  relay  approach,  while  still  imposing  much  lower  overhead  than  the  complete 
relay  approach. 


6  Conclusions 

This  paper  has  presented  a  basis  for  limiting  the  overhead  of  achieving  Approximate  Agree¬ 
ment  in  large  sparsely  connected  networks.  These  results  make  Approximate  Agreement 
feasible  in  large  Fault-Tolerant  Real-Time  systems.  They  thus  facilitate  the  confident  de¬ 
sign  and  verification  of  distributed  processes  such  ais  clock  synchronization  and  redundant 
sensor  management. 

The  general  approach  was  to  prohibit  the  relay  of  convergent  voting  messages  so  that  each 
processor  performs  convergent  voting  only  with  its  immediate  neighbors.  The  main  ac¬ 
complishments  were:  (1)  presentation  of  low-overhead  convergent  voting  algorithms  which 
function  without  message  relay,  (2)  analysis  of  the  convergence  rates  of  these  algorithms 
using  the  mixed-mode  fault  model,  (3)  determination  of  the  theoretical  bounds  on  fault- 
tolerance  as  a  function  of  the  topology  and  connectivity  of  the  network. 

The  bounds  for  Intersection  and  Union  Convergence  shown  in  Table  1  apply  to  the  entire 
MSR  family  of  algorithms.  Moreover,  the  required  algorithmic  and  topological  parameters 
are  easy  to  determine,  given  a  particular  MSR  algorithm  and  a  particular  topology.  These 
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results  provide  the  foundation  for  on-going  research  on  a  number  of  related  problems. 
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Abstract:  The  distributed  recovery  block  (DRB) 
scheme  initially  formulated  by  the  author  in  1983 
is  a  scheme  devised  for  achieving  task-level  real¬ 
time  fault  tolerance.  Under  the  scheme  a  fault 
arising  in  execution  of  a  real-time  task  does  not 
result  in  the  failure  of  the  timely  delivery  of  an 
expected  result  to  the  receiver  tasks  since  a 
redundant  task  is  acting  in  a  different  node  in  a 
timely  fashion.  The  DRB  scheme  can  be  used  to 
obtain  highly  reliable  real-time  computing 
stations  each  capable  of  forward  recovery  from 
both  hardware  and  software  faults.  Several 
demonstrations  of  the  performance  of  the  scheme 
in  practical  application  contexts  such  as  nuclear 
reactor  control  and  defense  command-control 
applications  were  conducted.  In  recent  years 
efforts  have  been  made  to  expand  the  application 
fields  of  the  DRB  scheme  by  extending  the 
operational  principles  of  and  the  implementation 
techniques  for  the  scheme  in  several  directions. 
The  fonnulated  extensions  have  been  discussed 
piecemeal  in  different  places.  This  paper  is  an 
attempt  to  take  a  comprehensive  assessment  of 
the  extension  efforts  made  and  to  present  some 
newly  formulated  extensions  and  desirable 
directions  for  future  extensions. 

Area:  Real-Time  Fault-Tolerant  Systems 
Design 


1.  Introduction 

Many  challenging  real-time  applications  such 
as  those  encountered  in  defense  and  space 
exploration  areas  deal  with  "dirty"  environments 
where  electrical,  electromagnetic,  and  mechanical 
disturbances  cause  relatively  frequent  failures  of 
computer  components.  Therefore,  real-time  fault 
tolerance  is  a  major  requirement  imposed  on  the 
computer  systems  used  in  such  applications 


unlike  in  commercial  applications  dealing  with 
environments  with  low  fault  rates.  Dramatic 
changes  occurred  in  the  past  decade  in  the  relative 
costs  of  hardware  to  software  and  those  of  VLSI 
processors  to  other  peripheral  and  storage 
components  eliminated  many  old  problems  that 
had  long  plagued  the  designers  of  complex  real¬ 
time  computer  systems  while  they  brought  in  new 
research  issues.  To  mention  a  few  examples: 

(1)  Little  incentives  for  multiprogramming  and 
more  for  distributed  processing'. 

Because  of  the  dramatically  reduced 
hardware  costs,  complex  software  schemes  such 
as  multiprogramming  techniques  for  heavy 
utilization  of  hardware  are  losing  their  appeals 
rapidly  in  many  safety-critical  real-time 
applications.  In  fact,  it  is  now  often  much  more 
cost-effective  to  design  one-task-per-processor 
systems  and  such  approaches  make  the  temporal 
behavior  analysis  easy  and  they  encourage  high- 
level  optimizations  such  as  those  aimed  for  faster 
guaranteed  response.  A  node  (or  nodes)  of  a 
distributed  computer  system  (DCS)  dedicated  to 
execution  of  an  atomic  real-time  task  is  called  a 
computing  station  in  this  paper. 

(2)  Ease  of  using  hardware  redundancy: 

Again  due  to  the  relative  low  cost  of 
hardware,  it  is  now  easier  than  before  to  use 
hardware  redundancy  for  hardware  fault 
tolerance,  in  particular,  by  use  of  active  redundant 
hardware  components.  Techniques  for  hardware 
fault  tolerance  with  the  forward  recovery  effect 
such  as  the  TMR  (triple  modular  redundancy) 
scheme  and  the  pair-of-comparing-pairs  scheme 
have  been  extensively  studied  [AndSl,  Car85, 
Toy87.  Wil85). 

(3)  Persistence  of  software  reliability  problems: 

Achieving  an  ultra-high  reliability  of  real¬ 
time  distributed  software  is  still  a  serious 
challenge. 
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These  technological  and  economic  changes 
have  also  dictated  concomitant  changes  in  the 
most  cost-effective  unit  of  redundancy  which  also 
defines  boundaries  for  fault  containment  and 
replacement  for  repair.  Consequently,  the  focus 
in  development  of  the  fault-tolerant  system  design 
technology  has  been  shifting  from  the  techniques 
for  utilizing  circuit-level  or  function-unit-level 
redundancy  through  those  for  utilizing 
processor-level  redundancy  to  those  for 
computing-station-level  redundancy. 

The  distributed  recovery  block  (DRB) 
scheme  initially  formulated  by  the  author  in 
[Kim84]  is  a  scheme  for  exploiting  redundancy  at 
the  computing  station  level  to  achieve  task-level 
fault  tolerance.  Under  the  scheme  a  fault  arising 
in  execution  of  a  real-time  task  does  not  result  in 
the  failure  of  the  timely  delivery  of  an  expected 
result  to  the  receiver  tasks  since  a  redundant  task 
is  acting  in  a  timely  fashion.  The  DRB  scheme 
can  be  used  to  obtain  highly  reliable  computing 
stations  each  capable  of  forward  recovery  from 
both  hardware  and  software  faults.  Several 
demonstrations  of  the  performance  of  the  scheme 
in  practical  application  contexts  were  conducted 
[Kim88.  Kim89a.  Hec89,  Hec91.  Kim91a]. 

In  recent  years  efforts  have  been  made  to 
expand  the  application  fields  of  the  DRB  scheme 
by  extending  ^e  operational  principles  and  the 
implementation  techniques  for  the  scheme  in 
several  directions.  The  formulated  extensions 
have  been  discussed  piecemeal  in  different  places. 
This  paper  is  an  attempt  to  take  a  comprehensive 
assessment  of  the  extension  efforts  made  and  to 
present  some  newly  formulated  extensions  and 
desirable  directions  for  future  extensions. 

The  paper  starts  with  an  overview  of  the  basic 
DRB  scheme  and  its  application  fields  in  Section 
2.  In  Section  3,  five  major  extensions  of  the  DRB 
scheme  are  discussed  and  remaining  issues 
related  to  full  development  of  the  extensions  are 
discussed.  Thereafter,  a  simplified  application  of 
the  DRB  scheme  to  highly  parallel  multi¬ 
computer  networks  (HPM's)  in  order  to  realize 
fault-tolerant  execution  of  a  large  number  of  real¬ 
time  tasks  is  discussed.  The  final  section 
concludes  with  a  discussion  on  four  desirable 
directions  for  future  extension  of  the  DRB 
technology. 


2.  Basic  DRB  Scheme 

2.1  Basic  principles 

The  distributed  recovery  block  (DRB) 
scheme  is  based  on  a  combination  of  both 
distributed  concurrent  processing  and  recovery 
block  structuring  concepts  to  achieve  fast  forward 
error  recovery  and  to  treat  both  hardware  and 
software  faults  in  a  uniform  manner  with  minimal 
execution  overhead  [Kim84,  Kim89a].  It  is  an 
active  redundancy  scheme  where  multiple 
processors  concurrently  execute  multiple  versions 
of  a  software  component  and  then  the  same 
reasonableness  check  routine.  The 
reasonableness  check  routine  in  each  processor, 
together  with  a  watch-dog  timer,  checks 
reasonableness  of  the  computational  results  of  the 
version  executed  as  well  as  the  timeliness  of  the 
execution. 

Recovery  block  consists  of  one  or  more 
routines,  called  trv  blocks  here,  designed  to 
compute  the  same  or  similar  results,  and  an 
acceptance  tc.st  which  is  an  expression  of  the 
criterion  used  for  accepting  the  results  of  try 
blocks  [Hor74,  Ran75].  A  try  (i.e.,  execution  of  a 
try  block)  is  thus  always  followed  by  an 
acceptance  test  execution.  If  an  error  is  detected 
during  a  try  or  as  a  result  of  an  acceptance  test 
execution,  then  a  roUback-and-retry  with  another 
try  block  follows.  Therefore,  it  is  an  enclosure  of 
some  recoverable  activities  of  a  process  and 
facilitates  backward  recovery  and  software  fault 
tolerance. 

Under  the  DRB  scheme,  a  recovery  block  is 
replicated  into  multiple  nodes  forming  a  DRB 
computing  station  for  parallel  processing.  In 
most  cases  a  recovery  block  containing  just  two 
try  blocks,  i.e.,  the  primary  and  the  alternate,  is 
designed  and  then  assigned  to  two  different  nodes 
called  the  primary  and  shadow  nodes  as  depicted 
in  Figure  1 .  The  roles  of  the  two  try  blocks  are 
assigned  differently  in  the  two  nodes.  Primary 
node  X  uses  try  block  A  as  the  first  try  block 
initially,  whereas  shadow  node  Y  uses  try  block  B 
as  the  initial  first  try  block.  Therefore,  until  a 
fault  is  detected,  both  nodes  receive  the  same 
input  data,  process  the  data  by  use  of  two 
different  try  blocks  (i.e.,  block  A  on  node  X  and 
block  B  on  node  Y),  and  check  the  results  by  use 
of  the  acceptance  test  Both  nodes  perform  all 
these  tasks  concurrently.  The  time  acceptance 
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test  (i.e.,  the  time-out  mechanism)  is  used  to 
ensure  timely  behavior  of  both  nodes. 

In  a  fault-free  situation,  both  nodes  will  pass 
the  acceptance  test  with  the  results  computed  with 
their  first  try  blocks.  In  such  a  case,  the  primary 
node  notifies  the  shadow  of  its  success  in  the 
acceptance  test.  Thereafter,  only  the  primary 
node  sends  its  output  to  the  successor  computing 
station(s).  Both  nodes  then  proceed  to  the  next 
task  cycle.  However,  if  the  primary  node  fails 
and  the  shadow  node  passes  its  own  acceptance 
test,  the  shadow  node  assumes  the  role  of  the 
primary,  i.e.,  the  nodes  exchange  their  roles.  To 
be  more  specific,  upon  its  failure  in  passing  the 
acceptance  test  the  primary  node  attempts  to 
inform  the  shadow  node.  The  shadow  node  will 
take  over  the  role  of  the  primary  as  soon  as  it 
receives  the  notice.  If  the  primary  node  crashes 
completely,  the  shadow  node  will  recognize  the 
failure  of  the  primary  upon  expiration  of  the 
preset  time  limit  It  will  then  become  the  new 
primary.  These  interactions  between  two  nodes 
are  done  asynchronously.  On  the  other  hand,  if 
the  shadow  node  fails  first,  the  primary  node  need 
not  be  disturbed.  In  both  cases,  the  failed  node 
attempts  to  become  an  operational  shadow  node; 
it  attempts  to  roll  back  and  retry  with  its  second 
try  block  to  bring  its  application  computation 
state  including  local  database  up-to-date.  This 
attempt  does  not  disturb  the  primary  node. 

This  approach  has  two  useful  characteristics; 

a)  Recovery  can  be  accomplished  in  the  sanre 
marmer  regardless  of  whether  a  node  fails  due  to 
hardware  faults  or  software  faults; 

b)  The  recovery  time  is  minimal  since  maximum 
concurrency  is  exploited  between  the  primary  and 
the  shadow  nodes. 

In  recent  years  basic  techniques  for 
implementation  of  the  DRB  scheme  were 
established  and  several  demonstrations  of  the 
performance  of  the  scheme  in  practical 
application  contexts  were  conducted.  For 
example,  several  experiments  were  conducted  at 
the  author's  location  and  they  involved 
application  of  the  DRB  scheme  to  adjacent 
computing  stations  in  real-time  parallel 
ptx>cessing  multi-computer  testbeds  [Kim88, 
Kim89a,  Kim91].  Figure  2  depicts  one  such 
testbed.  The  hardware  base  of  the  testbed 
contains  several  global  shared  memory  modules 
which  facilitate  inter-node  communication.  Each 
data  queue  serving  as  a  link  among  logically 


adjacent  nodes  is  housed  in  one  of  the  global 
shared  memory  modules.  The  distributed  real¬ 
time  application  software  in  this  testbed  was 
designed  to  perform  continuous  control  of  the 
optical  sensors  and  the  vehicle  guidance  and 
navigation  subsystem  in  order  to  track  a  rapidly 
moving  target.  The  results  demonstrated  the  fast 
forward  recovery  capability  of  the  DRB  scheme 
as  well  as  the  effectiveness  of  the  implementation 
approaches  formulated. 

Another  major  validation  was  conducted  by  a 
small  company  located  in  Los  Angeles,  SoHaR, 
Inc.  They  extended  the  DRB  scheme  for  use  in 
real-time  local  area  PC  networks  for  nuclear 
reactor  control  applications  and  produced  a 
product  prototype  [Hec89,  Hec91].  Figure  3 
depicts  a  high  level  view  of  such  networks. 


2.2  DRB  stations  in  HPM'S  vs.  DRB  stations 
in  LAN'S 

Since  the  DRB  scheme  is  a  technique  for 
realizing  a  "hardened"  real-time  computing 
station  and  since  both  real-time  computer  systems 
based  on  highly  parallel  multi-computer  networks 
(HPM’s)  and  those  based  on  local  area  networks 
(LAN’s)  can  also  be  structured  in  the  natural  form 
of  interconnections  of  real-time  computing 
stations,  the  application  fields  of  the  DRB  scheme 
cover  both  HPM  based  applications  and  LAN 
based  applications.  On  the  other  hand,  the 
differences  in  interconnection  structures  and 
mechanisms  between  the  HPM's  and  the  LAN's 
can  have  impacts  on  the  approaches  for 
implementation  of  DRB  computing  stations. 

In  LAN  based  systems,  the  inter-node 
communication  costs  are  greater  and  the  costs  of 
providing  redundant  communication  paths  are 
greater.  Therefore,  the  overhead  of  synchronizing 
the  primary  and  shadow  nodes  at  the  beginning  of 
each  task  cycle  as  well  as  the  overhead  for  status 
exchange  is  much  greater  in  LAN  based  systems 
than  in  HPM  based  systems.  Also,  in  some 
HPM's,  nodes  may  be  cormected  via  shared 
memory  modules  as  is  the  case  in  Figure  2.  In 
such  HPM  based  systems,  data  queues  hosted  on 
shared  memory  modules  serve  as  communication 
media  between  DRB  stations  as  well  as  between 
partner  nodes  belonging  to  the  same  DRB 
stations.  Data  queue  management  should  be  done 
such  that  not  only  the  partner  nodes  stay 
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synchronized  to  an  acceptable  degree  but  also 
they  preserve  input  data  consistency  in  the  sense 
that  they  always  pick  the  same  data  item  or  copies 
of  the  same  data  item  at  each  task  cycle  [Kim91  ]. 

3.  Major  Established  Extensions  of  the 
DRB  Scheme 

Five  major  established  extensions  of  the  DRB 
scheme  are  assessed  in  this  section. 


3.1  DRB  computing  station  based  on 
comparing  processor-pairs 

The  basic  DRB  scheme  relies  on  the  logic 
acceptance  test  and  the  time  acceptance  test  for 
error  detection.  For  faster  detection  of  hardware 
faults,  the  DRB  scheme  can  be  extended  to 
incorporate  various  established  mechanisms.  A 
hardware  fault  detection  scheme  that  has  been 
solidly  established  and  has  met  continuously 
growing  acceptance  due  to  the  improved 
hardware  economy,  is  the  comparing  processor- 
pair  scheme  [Toy78,  Wil85].  This  scheme  has 
been  incorporated  into  commercial  industry 
products  widely  in  use  such  as  AT&T  ESS  and 
Stratus  computers.  An  extension  of  the  DRB 
scheme  under  which  each  of  the  nodes  (primary 
and  shadow)  in  a  DRB  station  is  implemented  in 
the  form  of  a  comparing  processor-pair  is 
depicted  in  Figure  4.  Such  an  DRB  computing 
station  should  exhibit  much  shorter  detection 
latency  for  most  hardware  faults  than  the  ordinary 
DRB  station  does.  Therefore,  in  the  DRB  station 
shown  in  Figure  4,  only  some  rare  t3^s  of 
hardware  faults  and  software  faults  will  escape 
the  guards  set  by  the  comparing  processor-pair 
mechanism  and  will  have  to  be  detected  by  the 
acceptance  test  with  concomitant  larger  detection 
latencies. 

One  thing  to  note  here  is  that  faster  detection 
of  hardware  faults  occurring  in  a  node  within  a 
DRB  station  will  be  useful  only  to  the  extent  that 
the  failed  node  can  initiate  sooner  its  attempt  to 
become  a  shadow  node.  However,  another 
important  attractive  feature  of  the  comparing 
processor-pair  mechanism  is  its  high  coverage  in 
detecting  hardware  faults. 


Multiple  recovery  blocks  in  a  DRB 
computing  station 

For  reasons  such  as  efficient  node  utilization 
or  special  characteristics  of  the  applications,  often 
multiple  tasks  may  be  designed  to  reside  on  the 
same  node  in  spite  of  the  fact  that  the  single-task- 
per-node  approach  is  becoming  justified  with 
increasing  ease.  This  means  that  multiple 
recovery  blocks  may  reside  in  a  DRB  station 
[Kim91].  Actually,  the  following  three  cases  arc 
conceivable. 

(1)  Multi-procedure  DRB  station:  Each  of 
multiple  recovery  blocks  in  the  same  DRB  station 
is  provided  to  process  data  items  from  a  different 
source  (predecessor  computing  station)  or  to 
process  a  different  type  of  data  items.  The 
motivation  for  structuring  this  type  of  DRB 
stations  is  the  node  economy.  The  application 
software  of  a  multi -procedure  DRB  station  thus 
takes  the  form  of  a  "case"  statement  enclosing 
multiple  recovery  blocks  as  depicted  in  Figure 
5(a).  A  multi-procedure  DRB  station  is  depicted 
in  Figure  5(b). 

(2)  Multi-phase  DRB  station:  This  case  arises 
where  the  mission  life  of  a  task  running  on  a 
computing  station  consists  of  multiple  phases  and 
different  phases  require  substantially  different 
processing  algorithms.  The  operations  for  each 
phase  can  be  naturally  designed  into  a  separate 
recovery  block.  Although  it  is  possible  to  form  a 
separate  DRB  station  around  each  recovery  blocx, 
it  is  a  wasteful  approach  since  there  is  no 
parallelism  among  such  DRB  stations.  Therefore, 
a  multi-phase  DRB  station  can  be  viewed  as  one 
mnning  a  single  task  structured  in  the  form  of  a 
"case"  statement  enclosing  multiple  recovery 
blocks.  Figure  5(a)  and  5(b)  are  thus  applicable 
to  a  multi-phase  DRB  station  also. 

(3)  DRB  station  with  serially  bonded  recovery 
blocks:  This  DRB  station  contains  multiple 
recovery  blocks  connected  in  a  series  form  as 
shown  in  Figure  6.  Such  recovery  blocks  are 
called  serially  bonded  recovery  blocks.  Tliis  case 
naturally  arises  where  a  task  is  required  to  deliver 
its  processing  results  at  several  different  stages, 
possibly  to  different  destinations.  This  DRB 
station  structuring  can  be  motivated  not  only  for 
node  economy  but  also  for  improved  data 
turnaround  time.  To  be  more  specific,  if  two 
recovery  blocks  closely  related  in  Uie  form  of  a 
procedure-consumer  relation  are  assigned  to  two 
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separate  DRB  stations,  llicn  message 
communication  between  recovery  blocks  involves 
imer-node  communication.  In  LAN-based 
systems,  such  inter-node  communication  delay  is 
significantly  larger  than  the  intra node 
communication  delay  whi_h  would  be  incurred 
when  both  recovery  blocks  arc  assigned  to  one 
DRB  station.  Therefore,  the  single  DRB  station 
approach  may  lead  to  a  shorter  data  turnaround 
time  from  die  input  action  of  die  first  recovery 
block  to  the  output  action  marking  the  end  of  the 
second  recovery  block  execution.  On  die  other 
hand,  the  arrival  rate  of  input  data  for  a  DRB 
station  with  serially  bonded  recovery  blocks 
much  be  constrained  such  that  the  average  inter¬ 
arrival  time  is  substantially  larger  dian  the 
execution  time  for  all  the  serially  btinded 
recovery  blocks  combined  together. 

The  above  three  types  of  extended  structuring 
options  arc  believed  to  widen  the  application 
fields  of  die  DRB  scheme  considerably,  it  is  also 
important  to  note  that  various  combinations  of  the 
three  options  arc  feasible  although  detailed 
implementation  issues  and  cost-pcrformancc 
issues  need  to  be  studied  in  the  future. 


N  try  blocks  in  a  DRB  computing  station 

In  some  highly  safciy-cridcal  applications, 
the  system  designer  may  design  more  than  two  try 
blocks  into  a  recovery  block  for  the  sake  of 
increased  reliability  and  comfort.  Although 
several  approaches  to  struc'uring  a  DRB  station 
that  U.SCS  three  try  blocks  conceivable,  one  of 
the  most  natural  approaches  is  to  treat  the  third 
node  as  a  shadow  node  for  the  team  of  the  first 
two  nodes.  Such  a  station  is  depicted  in  Figure  7. 

Node  Z  in  the  figure  will  normally  use  try 
block  C  as  its  primary  try  block  and  deliver  iLs 
results  only  when  both  X  and  Y  fail  to  produce 
acceptable  results  in  time.  Nodes  X  and  Y 
behave  like  a  single  functional  node  with  respect 
to  interfacing  with  'heir  shadow  node  Z.  They 
must  share  responsibilities  for  providing  their 
status  information  to  node  Z  at  various  points  as 
well  as  responsibilities  for  understanding  the 
status  of  node  Z.  For  example,  the  type  of  an 
input  data  item  picked,  the  acceptance  test  result 
(an  indication  enabling  node  Z  to  determine  if  any 
one  of  the  two  nodes  X  and  Y  has  passed  its 
acceptance  test),  the  success  of  delivering  the 
result  by  node  X  or  Y  to  the  successor  stations. 


etc.,  arc  the  inloniiatuM  th.uii  needs  to  Ik*  pfuvkk  u 
to  node  Z 

II  riodc  .\  or  V  crashes,  then  it  c,in  tv 
replaced  by  nixlc  Z.  ;ind  llius  llte  st.iiit!ii  can  stan 
functioning  as  an  ordinary  two  node  DRB  station 
Similarly,  cra.sh  of  ikxIc  Z.  w  ill  result  in  Ute 
station  functioning  as  an  ordinary  two  fKxlc  DkB 
station,  11  both  X  and  Y  fail  at  their  3cccpi.iiKc 
tests  but  are  ahve,  then  node  Z  becomes  tiic  new 
primary  node  and  one  of  tlK'  two  (ailed  ruxlcs  X 
and  Y  should  become  the  new  secondary  node  (a 
shadow  for  ntxlc  Z)  and  the  otiicr  the  third  node 
(a  shadow  for  the  team  of  Z  and  the  oilier  node) 
Hie  linic-out  value  used  by  itie  lliird  ntxle  Z. 
wailing  for  ‘taiu-s  infonnation  from  Uve  team  of  X 
and  Y  can  be  somewhat  larger  titan  that  used  by 
Y  monitoring  itic  primary  tuxlc  X. 

An  important  advantage  of  the  approach 
depicted  in  Figure  7  is  the  recursive  nature  of  the 
approach.  Therefore,  in  an  n node  DRB  station, 
the  n-lh  node  functions  as  a  shadow  for  the  team 
of  the  first  n- 1  nodes.  A  natural  consequence  of 
this  recursive  organization  is  tlic  mrxlcst  increase 
in  the  implementation  complexity  as  the  number 
of  nodes  used  in  a  DRB  station  increases 


3.4  Adaptive  DRB  computing  station 

In  some  applications,  environmental 
conditions  that  affect  fault  tolerance  requirements 
imposed  on  computer  systems  change 
dynamically.  As  significant  changes  in 
environmental  condiiioas  or  in  internal  computing 
rc.sourcc  conditions  occur,  the  set  of  fault 
tolerance  mechanisms  that  arc  effective  also 
changes. 

An  intoresling  concept  for  extending  the 
DRB  scheme  is  the  dynamic  switching 
between  the  recovery  block  scheme  and  the 
DRB  scheme  in  response  to  changes  in 
resource  constraints  [Kim92a,  ATTn91 ).  The 
DRB  scheme  requires  more  processing  nodes 
than  the  recovery  block  scheme  but  facilitates 
forward  recovery.  Therefore,  the  recovery 
block  scheme  can  be  used  in  the  soft-real¬ 
time  mode  while  the  DRB  scheme  can  be 
used  in  the  hard-real-time  mode.  However,  if 
the  number  of  processing  nodes  available 
falls  below  a  certain  threshold  while  the 
system  is  operating  in  the  hard-real-time 
mode,  then  the  system  may  switch  from  the 
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DRB  scheme  to  the  recovery  block  scheme 
for  execution  of  selected  tasks.  The  adaptive 
version  of  the  DRB  scheme  which  can  transit 
among  several  modes  of  operation  with 
concomitant  changes  in  resource 
consumption  and  recovery  performance,  is 
called  here  the  adaptive  DRB  scheme. 

Let  us  now  consider  an  approach  to 
facilitating  the  switching  between  the  recovery 
block  scheme  and  the  DRB  scheme.  Figure  8 
depicts  such  an  approach.  As  shown,  a  node 
executes  status- 1,  status-2,  check- 1*,  and  status-3 
actions  only  when  it  is  operating  as  the  primary 
node  of  a  DRB  station.  Its  mode  of  operation  can 
be  determined  by  checking  if  the  valid  ID  of  the 
shadow-partner  node  is  in  the  relevant  data 
structure.  Similarly,  a  node  can  be  either  in  the 
mode  of  functioning  as  the  shadow  node  of  a 
DRB  station  supporting  the  primary-partner  node 
or  in  the  simplex  mode  of  executing  another  task 
(or  recovery  block)  independently.  When  the 
node  is  ordered  to  be  in  the  latter  mode  by  the 
system  resource  allocator,  the  node  should  also  be 
given  an  instruction  as  to  which  task  (or  recovery 
block)  to  execute. 

Therefore,  when  the  system  resource  allocator 
converts  a  node  executing  a  recovery  block  in  the 
simplex  mode  into  the  primary  node  of  a  DRB 
station,  the  following  steps  are  involved.  The 
system  resource  allocator  first  designates  a  node 
to  become  the  shadow  node  of  the  DRB  station. 
The  allocator  activates  the  shadow  node  first  by 
informing  the  node  of  the  ID  of  its  primary- 
partner  node.  This  activation  may  or  may  not 
involve  an  abortion  of  an  on-going  task 
execution.  Thereafter,  the  allocator  instructs  the 
primary  node  to  supply  the  necessary 
computation  stale  to  the  shadow-partner  and  then 
start  cooperative  redundant  execution. 

3^  Integration  of  the  DRB  scheme  and 
network  configuration  management  schemes 

In  order  to  shorten  fault  detection  latency  and 
further  enhance  the  survival  period  of  the  DCS, 
the  DRB  scheme  must  be  integrated  with 
techniques  for  network  configuration 
management  (NCM).  The  NCM  function 
generally  involves  detecting  crashed  nodes, 
whether  they  were  in  busy  (non-idling)  states 
before  the  crashes  or  not,  and  reincorporating 


repaired  nodes  into  the  operating  network 
configuration. 

The  integration  of  tire  DRB  scheme  and  a 
practical  centralized  NCM  scheme  has  been 
developed  by  SoHaR,  Inc.  [Hcc89,  Hec91j.  'Fhc 
cenu-alized  NCM  approach  has  an  advantage  of 
its  simplicity  but  can  become  a  .single  point  of 
failure. 

Several  decentralized  approaches  to  NCM 
have  also  been  studied  in  recent  years  (Cri88, 
Jen89,  Kop89,  Kim92b).  Integrations  of  the  DRB 
scheme  with  such  decentralized  NCM  schemes 
have  yet  to  be  accomplished. 

Once  a  node  in  a  DRB  station  is  functionally 
or  physically  amputated  off  for  repair,  then  the 
system  resource  allocator  attempts  to  find  a 
replacement  rrodc.  Such  restructuring  must  be 
implemented  in  a  architecture -dependent  manner 
since  efficient  synchronization  and  efficient  status 
exchange  between  the  partner  nodes  in  a  DRB 
station  arc  always  desirable.  Moreover,  the  is.suc 
of  non-disruptivc  rcioin.  i.e.,  incorporating  a  new 
node  into  a  DRB  station  and  conditioning  it  into 
an  active  shadow  node  without  disturbing  the 
primary  node  much  is  a  non-trivial  one. 

Therefore,  implementation  of  a  repairable  DRB 
station  is  a  subject  awaiting  much  further  study. 

4.  A  Simplified  Application  of  the  DRB 
scheme  to  HPM's 

The  applicability  of  the  DRB  scheme  to  the 
HPM's  for  fault-tolerant  execution  of  real-time 
tasks  was  already  mentioned  in  Section  2.  The 
DRB  scheme  can  be  viewed  as  a  software- 
implemented  approach  for  achieving  fault 
tolerance  in  HPM's  without  requiring  special 
hardware  mechanisms.  An  important  point  here 
is  that  in  applying  the  DRB  scheme  to  an  HPM. 
the  scheme  need  not  be  utilized  in  its  full 
generality.  To  be  more  specific,  if  the  system 
developer  is  not  concerned  with  possible  software 
faults,  then  alternate  try  blocks  arc  not  necessary. 
Only  one  algorithm  needs  to  be  designed  for  each 
task. 

Moreover,  it  is  not  a  requirement  to  design  a 
task-specific  acceptance  test.  A  common 
accsplapcsigsl  designed  to  perform  spot  checks 
on  a  few  selected  areas  of  the  machine  hardware 
or  integrity  checks  for  various  data  structures  can 
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be  executed  at  the  end  of  each  task  to  decide 
whether  to  trust  the  task  result  as  an  acceptable 
one  or  not.  In  such  a  case  where  a  task- 
independent  corrunon  acceptance  test  is  used  and 
no  alternate  try  blocks  are  used,  forming  a  DRB 
station  dedicated  to  fault-tolerant  execution  of  a 
task  becomes  a  mechanical  process  which  does 
not  burden  the  application  software  designer  in 
any  way.  This  approach  can  thus  be  viewed  as  a 
concrete  approach  to  mechanical  replicated 
execution  of  real-time  tasks  in  HPM’s.  The 
cooperation  between  the  partner  nodes  follows 
the  same  protocol  discussed  in  Section  2. 

An  important  issue  in  designing  a  HPM 
based  real-time  system  is  that  of  mapping 
tasks  to  the  nodes  of  the  HPM.  When  the 
aforementioned  mechanized  formation  of 
DRB  stations  is  used,  task  mapping  involves 
assignment  of  each  task  to  two  nodes  in 
reasonable  close  proximity  [Kim90].  The 
distance  between  partner  nodes  must  be  shon 
because  it  impacts  directly  the  efficiency  of 
synchronization  and  that  of  status  exchange 
between  the  partners  and  thus  impacts  the 
data  turnaround  time  also. 

5.  Summary  and  Future  Extension 

The  DRB  scheme  is  a  basic  technology  for 
realizing  a  real-time  fault-tolerant  computing 
station  which  is  a  component  of  a  real-time  DCS 
and  as  such  can  receive  input  data  from  and  send 
computation  results  to  other  computing  stations  in 
the  same  DCS.  It  has  been  evolved  into  a  broadly 
applicable  technology  in  the  past  nine  years  and 
has  been  demonstrated  via  several  testbeds  and 
one  product  prototype.  The  extended  DRB 
schemes  reviewed  in  Section  3  have  significantly 
broader  application  fields  than  the  basic  DRB 
scheme  does.  However,  it  is  fair  to  say  that  the 
DRB  scheme  is  by  and  large  a  technique 
specialized  for  safety-critical  real-time 
applications  and  not  yet  a  fiilly  matured 
technology.  The  following  directions  are 
considered  to  be  among  the  most  important  for 
bringing  the  DRB  technology  to  a  more  mature 
(more  widely  and  easily  practicable)  form. 

(1)  Integration  of  the  DRB  scheme  and 
decentralized  NCM  schemes 

The  interesting  development  by  SoHaR,  Inc 
on  an  integration  of  a  centralized  NCM  scheme 


and  the  DRB  scheme  was  mentioned  earlier. 
There  arc  a  broad  range  of  real-time  LAN 
applications  where  decentralized  NCM  is 
desirable.  Therefore,  there  are  needs  for 
establishing  efficient  decenu-alized  approaches  to 
NCM  and  integrating  them  with  the  DRB  scheme, 

(2)  Highly  adaptive  DRB  stations 

Further  extension  of  the  integration  task 
in  (1)  is  to  fully  establish  the  technique  for 
structuring  adaptive  DRB  stations  discussed 
in  Section  3.4.  In  a  highly  adaptive  DRB 
station,  at  least  three  different  modes  of 
operation  are  conceivable:  sequential  backward 
recovery  mode  (recovery  block  scheme), 
concurrent  processing  fonvard  recovery  mode 
(DRB  scheme),  and  sequential  forward  recovery 
mode  in  which  a  specially  designed 
application-level  recovery  routine  is  invoked 
without  automatic,  fully  application- 
transparent  rollback  upon  failure  of  the 
primary  routine  to  produce  an  acceptable 
result.  More  importantly,  the  criteria  used  for 
making  decisions  for  switching  among  the  three 
modes  must  be  related  to  resource  conditions 
among  others.  They  need  to  be  established  in 
concrete  forms  and  validated  in  future  research. 

(3)  Integration  with  the  object-based  structuring 
concept 

Object-based  structuring  approaches  are 
meeting  increasing  acceptance  from  system 
designers  for  reasons  such  as  modularity,  etc. 

This  is  the  case  for  both  soft-real-time  systems 
and  hard-real-time  systems  [Kop90j.  Adaptation 
of  the  DRB  scheme  to  an  object-based  approach 
for  structuring  of  real-time  tasks  is  an  important 
subject  for  future  study. 

(4)  Distributed  conversation  (DCONV)  scheme 

The  DRB  scheme  is  applicable  to  non¬ 
interacting  segments  (i.e.,  atomic  tasks)  of 
application  processes.  To  put  it  another  way,  it  is 
a  scheme  to  prevent  a  fault  from  crossing  the 
boundaries  between  real-time  processes  as  much 
as  possible.  For  protecting  against  faults  leaking 
through  the  guards  established  by  the  DRB 
scheme,  supplementary  schemes  are  needed.  A 
promising  case  of  a  supplementary  scheme  is  the 
distributed  conversation  (DCONV)  scheme 
[Kim89b]  which  is  essentially  a  combination  of 
the  conversation  structuring  scheme  [Ran7.5]  and 
the  approach  of  concurrent  execution  of 
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redundant  software  components  which  was 
exploited  in  the  DRB  scheme.  In  a  sense,  the 
DCONV  scheme  can  be  viewed  as  an  approach  to 
hardening  a  group  of  interacting  computing 
stations.  The  scheme  is  capable  of  achieving 
forward  recovery  when  a  part  or  all  of  a  group  of 
computing  stations  fail.  The  research  in  this 
scheme  is  however  in  its  early  stage. 
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Figure  1 .  The  basic  DRB  computing  station 
(Adapted  from  [Kim89a]) 
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Figure  3,  The  DRB-based  fault-tolerant  LAN  architecture  developed  by  SoHaR,  Inc. 
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Figure  6,  Serially  bonded  recovery  blocks  in  a  node  within  a  DRB  station 
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Figure  8,  An  adaptive  DRB  station:  Conditional  activation  of  the  DRB 
(Adapted  from  [Kim92a] 
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Abstract 

The  single  fault-type  models  employed  in  designing  dependable  systems  usually  provide  ei¬ 
ther  overly-optimistic  or  overly-pessimistic  assessments  of  fault  coverage  or  reliability.  The 
mixed  fault-type  Hybrid  Fault  Model  (HFM)  permits  more  realistic  algorithm  design  and  as¬ 
sociated  system  modeling.  The  HFM  classifies  faults  in  terms  of  the  effects  of  these  faults 
on  system  operations.  The  set  of  all  faults  is  partitioned  into  three  disjoint  categories;  non- 
malicious  faults,  malicious  symmetric  faults,  and  malicious  asymmetric  faults.  Then,  the  type 
of  algorithm  required  to  detect  or  mask  the  subset  of  faults  that  is  assumed  to  occur  is  indicated 
as  a  function  of  the  fault  type  and  a  closed  form  expression  for  system  reliability  is  provided. 
Reliability  estimates  for  the  hybrid  model  are  then  compared  to  those  for  existing  models,  and 
the  impact  of  both  models  on  system  design  decisions  is  assessed. 


1  Introduction 

The  fault  models  used  in  designing  dependable  distributed  systems  typically  make  simplifying 
assumptions  about  the  natures  of  faults^  in  the  system.  Often,  the  fault  tolerance  algorithms 
employed  by  a  system  treat  all  faults  identically,  ignoring  the  effects  of  any  fault  types  the  algorithm 
is  not  designed  to  distinguish  or  to  tolerate.  Such  overly-optimistic  single  fault-type  models  assume 
a  fixed  number  of  benign  permanent  faults  and  perfect  fault  coverage.  Or,  the  system  model  employs 
complex  protocols  that  assume  all  faults  to  be  pernicious,  even  though  only  a  small  portion  of  the 
faults  may  actually  require  such  protection.  By  distinguishing  different  fault  types  and  considering 
varying  probabilities  of  occurrence  of  each  fault  type,  we  can  develop  more  realistic  system  models 
to  design  algorithms  capable  of  handing  the  various  fault  types. 

We  have  previously  defined  the  HFM  and  its  impact  on  the  reliability  modeling  of  ultra-reliable 
systems  [1,  2,  3].  In  this  paper,  we  examine  the  dependability  and  fault  resiliency  of  several 
distributed  system  paradigms  under  the  HFM,  using  the  classical  single-fault  models  as  a  basis  for 
comparison.  Under  the  HFM,  the  set  of  all  faults  is  partitioned  into  three  disjoint  classes  based 
on  fault  effects:  non-malicious,  malicious  symmetric,  and  malicious  asymmetric.  Then,  the  type 
of  algorithm  required  to  detect  or  mask  the  subset  of  faults  that  is  assumed  to  occur  is  indicated 

'Supported  in  part  by  ONR  Contract  #  N00014-91-C-0014 

'A  fault  is  the  identified  or  hypothesised  cause  of  an  error.  An  error  is  the  manifestation  of  a  fault,  an  undesired 
state  either  at  the  boundary  or  at  an  internal  point  in  the  system  or  process.  A  failure  is  the  inability  of  the  system 
or  component  to  provide  the  specified  service  caused  by  an  error. 
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as  a  function  of  the  fault  type.  This  matching  of  fault  type  to  algorithm  is  important  in  ensuring 
adequate,  yet  cost  effective,  system  fault  coverage.  K  the  fault  tolerance  techniques  implemented 
do  not  support  segregation  and  handling  of  mixed  faults,  then  the  hybrid  fault  model  reverts  to 
the  overly  optimistic  or  pessimistic  single  fault-type  models,  with  no  improvement. 

After  providing  the  motivation  for  our  current  work,  we  define  the  hybrid  fault  taxonomy  on 
which  the  HFM  is  based.  Next,  the  classical  single-type  fault  models  and  their  associated  reliability 
expressions  are  presented  in  the  context  of  our  hybrid  taxonomy.  After  an  overview  of  the  HFM  and 
the  associated  reliability  expressions,  we  define  our  dependable  system  framework.  We  next  define 
several  system  characteristics,  compare  systems  imder  both  single-fault  and  hybrid  fault  models, 
and  provide  fault  tolerance  strategies  appropriate  to  a  variety  of  applications.  After  applying  both 
fault  models  to  example  system  paradigms,  we  conclude  by  summarizing  our  results  and  discussing 
future  research  plans. 

2  Motivation 

Several  taxonomies  have  been  proposed  that  provide  the  fault  characteristics  assumed  by  system 
fault  and  reliability  models  (4,  5,  6].  Characteristics  such  as  duration  (permanent,  transient,  inter¬ 
mittent);  nature  (hardware,  software);  behavior  (arbitrary,  restricted);  and  count  (single,  multiple) 
have  long  been  used  to  model  the  assumed  fault  effects  in  computing  or  estimating  system  relia¬ 
bility  [4,  7,  8].  With  the  exception  of  [9,  10,  11],  the  models  also  invariably  focus  on  a  single  fault 
type.  If  the  possibility  of  arbitrarily  malicious  or  Byzantine  faults  (12j  is  considered,  many  fault 
tolerance  algorithms  and  the  resulting  reliability  estimates  treat  all  faults  as  potentially  Byzan¬ 
tine.  Toleration  of  such  faults  requires  complex,  communication-intensive  protocols  [12,  13,  14,  15], 
designed  to  restrict  the  malice  of  faults  that  can  be  introduced  into  the  communication  process. 
Thus,  the  arbitrarily  malicious  behavior  of  faulty  nodes  is  prevented  from  disrupting  the  operation 
of  non-faulty  ones.  Other  fault  tolerance  strategies  ignore  fault  types  assumed  to  have  low  oc¬ 
currence  probabilities,  such  as  Byzantine,  and  then  adjust  the  reliability  estimates  using  coverage 
factors  [4,  7]. 

In  defining  the  HFM,  we  assume  a  fully  connected  system  consisting  of  nodes  which  communicate 
using  synchronous  message  passing,  with  an  upper  bound  on  the  time  required  for  a  node  to  generate 
and  send  a  message.  Individual  nodes  make  decisions  and  compute  values  based  on  information 
received  in  messages  from  other  nodes.  The  status  of  a  node,  faulty  or  good,  is  discerned  by  other 
nodes  through  the  contents  of  messages  originating  from  the  target  node,  or  through  the  lack  of 
an  expected  message  from  that  node.  As  in  [9]  and  [16],  a  non-faulty  node  can  always  identify  the 
sender  of  a  message  it  receives  and  can  detect  the  absence  of  an  expected  message. 

3  The  Hybrid  Fault  Taxonomy 

The  hybrid  fault  taxonomy,  based  on  our  work  in  [9],  [11],  and  on  the  following  definitions,  classifies 
faults  according  to  the  errors  they  cause  and  the  techniques  needed  to  tolerate  those  errors.^ 

The  scope  of  a  fault  refers  to  the  portion  of  the  system  affected  by  that  fault,  also  called  the 
fault  extent.  A  symmetric  fault  generates  errors  that  are  manifested  identically  throughout  the 

’Although  we  use  a  different  definition  of  fault  malice,  these  fault  classes  are  equivalent  to  the  classes  of  the  same 
name  in  [9]. 
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fault  scope.  An  asymmetric  fault  generates  errors  that  are  manifested  differently  throughout  the 
fault  scope.  Asymmetric  faults  are  potentially  more  difficult  to  tolerate  than  symmetric  faults. 

Active  redundancy  techniques  attempt  to  acliieve  fault -tolerance  through  fault-detection,  alone 
or  in  conjunction  with  location  and  recovery.  Since  we  are  dealing  with  static  redundancy  manage¬ 
ment,  no  location  or  recovery  techniques  are  addressed.  Passive  redundancy  techniques  use  fault 
masking  to  hide  the  occurrences  of  faults  and  to  eliminate  the  effects  of  faults,  thus  avoiding  errors. 
For  further  details,  see  [4].  Non-iierative  passive  redundancy  techniques  require  a  single  round  of 
message  exchange.  Iterative  fault  masking  techniques  include  procedures  such  as  interactive  con¬ 
vergence  and  interactive  consistency,  requiring  additional  rounds  or  iterations  of  message  exchange 
among  participants  [13,  14,  16].  Fault- tolerant  voting  techniques,  such  as  majority  and  median,  are 
non-iterative  passive  redundancy  primitives  on  which  iterative  passive  redundancy  techniques  are 
often  based. 

N on-malicious  faults  can  and  will  be  detected^  in  a  non-faulty  node  by  the  active  redimdancy 
techniques  implemented  in  that  node.  Malicious  faults  are  those  faults  that  cannot  be  detected 
by  the  implemented  active  redundancy  tecluiiques,  but  require  masking  using  passive  redundancy 
techniques. 

Combining  the  attributes  of  fault  malice  and  symmetry  produces  the  four  mutually  exclusive  and 
collectively  exhaustive  fault  sets:  non-malicious  symmetric  faults  (B5),  non-malicious  asymmetric 
faults  {Pa)i  malicious  symmetric  faults  (5),  and  malicious  asymmetric  faults  (>1).  The  worst-case 
(most  severe'*  or  most  difficult  to  detect  or  tolerate)  faults  are  those  in  A,  corresponding  to  the 
classic  Byzantine  fault  where  a  faulty  node  supplies  at  least  two  different,  correctly  framed,  values 
to  different  nodes.  Faults  in  5  are  less  severe  than  the  faults  in  A,  but  are  more  severe  than 
faults  ill  Ba  U  Bs-  Faults  in  Bs  and  Bj^  are  comparable  in  severity,  including  benign  faults,  trash 
faults,  and  the  subset  of  Byzantine  faults  that  can  be  detected  using  active  redundancy  techniques, 
such  as  framing  errors  or  missing  messages.  While  the  hybrid  fault  model,  presented  in  §5,  does 
not  partition  non-malicious  faults  into  asymmetric  and  symmetric  subsets,  the  single  fault  effects 
assumed  in  the  classical  models  described  below  require  this  distinction. 

4  Classical  Fault  Models  and  Reliability 

Many  systems  fail  to  state  their  fault  assumptions  explicitly.  Instead,  they  assume  perfect  fault 
coverage,  even  though  the  fault  tolerance  techniques  they  employ,  which  implicitly  define  an  as 
sumed  fault  model,  may  not  be  resilient  to  all  fault  types.  A  system  using  only  active  redundancy 
techniques  is  capable  of  detecting  non-malicious  or  benign  faults  from  set  (B  =  U  Ba)-  However, 
if  an  (uncovered)  malicious  fault  from  set  (.Au5)  occurs,  system  failure  is  likely  to  occur.  When 
only  non-iterative  passive  redundancy  techni-  ues,  such  as  majority  or  fault- tolerant  midpoint  votes, 
are  implemented,  symmetric  faults  from  the  ..et  (S  ==  S5  U  5)  are  masked,  but  the  occurrence  of 
asymmetric  faults  from  set  (A  =  Ba  U  A)  can  cause  the  system  to  fail.  The  use  of  interactive 
consistency  and  interactive  convergence  algorithms  ensures  all  fault  types  are  covered,  since  such 
algorithms  mask  arbitrary  faults. 

The  assumed  system  dependability  requirement  is  the  ability  to  compute  a  correct  result  in  the 
presence  of  faults,  and  we  assume  static  redundancy  management.  Thus,  combinatorial  formulas 

^By  definition,  a  fault  that  is  undetected  by  the  active  redundancy  techniques  implemented  in  a  non-faulty  node 
i.s  malicious. 

’Severity  is  subjective,  relative  to  the  context  of  the  system  model. 
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are  sufficient  to  j>rovi<i»-  reliability  estimates.  Id.-ntical  node  reliabiliH,  /itt|.  !^  .slUe  iui: 

Ml)  assumptions  arc  made  about  node  faiiur*'  distributions,  i'he  reliabiijtv  is  a  funrtioii  of  o,.  i(,, 
minimum  number  of  good  noties  required  in  a  set  of  ,V  nodes  to  maintain  svslem  opi  rat  lo!:-.,  isiiefi 
ftrn  »/  tv(0  is  given  by 

f ‘(1  /fim* 
ro  V'/ 

and  m  is  a  function  of  the  covered  faidt  set  7  . 

Hased  on  this  discussion,  we  next  define  tfie  classical  fault  scenarios  <'p.,  <  s.  and  (  ,  where  sin 

subscript  is  indicative  of  the  faidts  covered  bv  tbe  model. ^ 


(’{i:  Using  active  redundancy  algorilbms,  A  minimum  of  np  nodes  is  needeii  to  (over  /|.  faults  in 
B,  where  hr  -  /b  ♦  1.  System  r<diabilitv  is  given  bv  exfjression  (  It  vsith  my.  ! 


Cg:  Using  non-iterative  passive  redundancy  algorilbins,  a  minimum  <>{  nodes  is  needed  to  ma<-k 
fs  faults  in  S,  where  ns  -  2/s  ■»  1.  Heliabilitv  is  given  hy  expression  {-i)  with  ttw  ru 


U 


A- 


Idsing  iterative  passive  redundancy  algorithms,  a  minimum  of  n.\  nodi's  is  neruieti  to  mask 
/a  arbitrary  faults,  where  3fx  •  1.  Mir  reliability  is  given  bv  exfiression  (  M  with 


I'he  impact  of  the  classical  scenario^  on  svstem  design  is  addressed  hdlowing  the  dehnitiof;  -d 
(he  iivbrid  fault  model. 


5  The  Hybrid  Fault  Model 

Tlie  hybrid  fault  model  comprises  three  scenarios  based  on  the  worst  rase  faults  rovered  in  earh 
scenario.  Sets  ,4  and  S  are  as  defined  in  §3.  while  the  set  of  non  inalirious  faults  is  given  b-. 
B  --  Bj,  U  Bs-  By  definition,  tfie  sets  0,  5,  and  .4  are  disjoint.  Since  exjierinient al  evidence 
suggests  that  faults  in  B  are  the  rno't  common,  with  faults  in  S  less  common  than  (hose  in  B.  and 
faults  in  -4  the  least  common  of  all,  the  fault  assumptions  made  in  a  given  system  can  be  used  to 
evaluate  the  impact  of  implementittg  the  different  BFM  scenarios  presenletf  below 

5.1  Hybrid  Fault  Scenarios 

I  he  key  to  the  hybrid  fault  model  is  tf.(  associate  tlie  proper  hybrid  algorithm  with  the  assimietf 
system  or  node  fault  set.  The  notation  II x  is  used  to  indicate  that  the  scenario  assumes  ihat  she 
worst  case  faults  ••’.e  in  set  where  Af  r 

Hb-  Faults  in  J-g  B  are  covered.  Hybrid  active  redundancy  algorithms  and  at  least  rtg  nodes 
are  required  to  tolerate  /g  non-rnalicious  faults,  where  ng  ;  fg  i  {rg  *  1)  'Phe  parameter  rg 
is  a  fixed  index,  dependent  upon  the  desired  fault  roverage,  where  (I  i  rg]  is  )}k’  rninimuni 
number  of  nodes  required  for  the  system  to  remain  operational 

‘  1  tic  subscripts  B,  S,  and  A,  ..rerring  to  benign,  svinmcltit,  and  asyrrunctrn  faults,  shmitd  lu  i  itr  <  nnfusrd  “itii 
the  sets  C.  S,  and  A  of  the  hybrid  fault  laionoitry  Although  the  set  B  is  equivalent  to  the  set  P  Rji  Bv  «' 
defined  in  A  /  A,  S  /  and  sets  B,  S.  and  A  are  not  disjoint 
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Ms-  Faults  in  T-'s  covered  using  hybrid  non  ilPralivt*  passive  redundancy  algorithms. 

At  least  ns  {fg  +  fs)  +  (t^  f  1)  processes  are  needed  to  tolerate  {fg  -  fsj  faults,  where 
<5m«x  —  fs  S  •S'ma*-  if  operation  in  the  presence  of  only  one  non  faulty  node  is 

possible,  then  ts  =  5„bx-  Otherwise,  ts  >  if  at  least  (rs  +  1)  good  nodes  are  required 

//^:  The  fault  set  is  so,  all  possible  faults  are  covered.  Failure  of  the  algurithin 

to  tolerate  any  fault  corresponds  to  a  failure  of  the  node  running  the  algorithm.  A  ininiinuiti 
of  =-  {2 /a  f  2fs  +  /e  -t  4  1)  nodes  is  sufficient  to  tolerate  (/>  I  fg  I  fs)  faults,  'fhe 
maximum  number  of  faults  in  A  that  can  be  tolerated  is  '"-^3  ’  !  "‘th  /.4  <  -Am... 

at  least  (r^  4  l)  good  nodes  assumed  to  be  necessary  for  the  system  to  remain 
operational.  If  a  hybrid  interactive  consistency  algorithm  with  r  rounds  of  rebroadrast  i.s  used, 
then  the  further  restriction  of  /^  <  r  is  also  necessary. 

The  hybrid  fault  tolerance  algorithms  required  by  the  HFM  first  apply  an  active  redundancy 
technique  to  each  message  or  value  received  by  a  node  to  discern  any  non-maiicious  faults,  using, 
for  example  sanity  checks,  formatting  checks,  and  error  detection  and/or  correction  codes.  If  a 
non-malicious  fault  is  detected,  such  as  a  framing,  parity,  or  encoding  fault,  a  missing  message,  or 
a  range  violation,  then  a  default  error  or  status  value  is  adopted  as  the  value  received  by  the  node 
in  the  message.  We  can  also  assume  perfect  detection  of  non-maJicious  faults  because  any  fault  not 
detected  by  the  active  redundancy  techniques  implemented  in  the  node  is  malicious  by  definition. 

Next,  passive  redundancy  techniques  appropriate  to  the  application  arc  applied  to  the  remaining 
values  received  by  good  nodes.  We  modify  existing  passive  redundancy  teclmiqucs  to  ignore  or 
exclude  the  default  error  or  status  value  from  any  calculations  or  comparisons.  In  the  absence  of 
non-malicious  faults,  no  elements  are  excluded.  Hybrid  voting  functions  are  derived  in  jl,  3|  from 
the  median,  majority,  and  t-fault-tolcrant  mean  and  midpoint  (Mj  functions  used  in  non  iterative 
passive  redundancy  algorithms.  A  hybrid  interactive  consistency  algorithm  is  presented  in  |9].  It 
should  be  noted  that  the  exclusion  of  error  values  and  the  abilities  of  different  nodes  to  receive 
different  numbers  of  error  values  may  result  in  a  decrease  in  the  number  of  values  presented  to 
the  aforementioned  fault- tolerant  voting  functions.  Thus,  the  degree  of  fault  tolerance  of  hybrid 
passive  redundancy  algorithms  may  need  to  be  adjusted  dynamically,  as  shown  in  |3|  for  a  hybrid 
interactive  convergence  algorithm, 

5.2  HFM  Scenario  Reliability 

Since  tlie  assumed  system  dependability  requirement  is  the  ability  to  compute  a  correct  result  in 
the  presence  of  faults,  and  static  redundancy  management  is  assumed,  combinatorial  frtrrniilas  are 
sufficient  to  provide  reliability  estimates.  The  expressions  for  reliability  under  the  HFM  are  more 
complex  than  those  for  the  classical  fault  scenarios,  as  they  are  based  upon  the  probabilities  of 
occurrence  of  mixed  fault  types.  The  combinatorial  formulas  stated  below  are  derived  in  [l,  llj. 
Again,  identical  node  reliability,  R{t),  is  assumed,  with  no  assumptions  regarding  the  distribution 
of  node  failures. 

Reliability  under  the  three  HFM  scenarios  can  be  estimated  by  considering  a  system’s  oper¬ 
ational  states  under  combinations  of  mixed  faults.  The  conditional  probabilities  of  occurrence  of 
type  A,  S  and  B  faults,  given  that  a  fault  has  occurred,  are  given  by  /Xy»,  /i,s,  and  pg,  where 
FA  4  4  pfl  =  1.  Typically,  we  also  have  fig  »  Us  >>  I^A- 
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Scenario  //^  Reliabtiity  By  defiiutioii,  the  number  of  nodes,  A’,  satisfies 


iV  >  2fA  f  2fs  4  /s  ^  +  1- 


The  system  reliability  is  then  computed  by  sunrmiing  over  all  possible  operational  states  according 
to  Ha  as  follows; 


(1 


where  B  ^  N  -  r a  ~  S  =  [^--^“->--1],  and  A  min{T^,  [ ^ 

For  each  operating  state,  the  triple  (a,s,6),  corresponds  to  the  triple  {/a,  fs,  fs),  indicating  the 
number  of  each  type  of  fault  occurring  in  that  state.  By  varying  the  value  of  this  model  covers 
both  interactive  convergence  and  interactive  consistency  algorithms.  Equation  (1)  is  equivalent  to 
the  expression  given  in  (ll)  with  —  r  under  the  assumption  of  a  hybrid  interactive  consistency 
algorithm. 


Scenario  Hg  Reliability  By  the  definition  of  f/j,  N  >  fs  ^  Jb  ^  ■<  I-  B  perfect  fault 

coverage  is  assumed,  the  conditional  probability  of  an  asymmetric  fault  is  taken  to  be  rero,  and  we 
have  fJis  +  ^B  =  1-  However,  by  assuming  that  fts  and  pa  do  not  sum  to  one,  i.e.,  1  -  {fis  I  pg)  75 
for  75  >  0,  the  probability  of  failure  due  to  an  uncovered  fault  can  be  included  in  the  reliability 
computation.  However,  since  the  probability  of  system  failure  in  the  presence  of  a  single  fault  in  A 
is  unity,  no  operating  state  can  sustain  an  asymmetric  malicious  fault.  So,  regardless  of  the  value 
of  iiA,  we  again  sum  over  all  the  operating  states  to  yield  given  by 

ft  -  mr-imr-'- 

where  B  =  N  -  ts  -  I  and  S  =  min(T5,  N  -  ts  -  b  -  1). 

Scenario  Hg  R^oability  We  have  N  >  /a  4  rg  4  1,  as  defined  in  §5.  If  perfect  coverage 
is  assumed,  then  fig  =  1,  with  fig  =  fiA  =  0;  otherwise,  I  -  fig  —  ng  for  some  jg  >  0,  and 
the  probability  of  correct  operation  in  the  presence  of  either  a  malicious  symmetric  or  malicious 
asymmetric  fault  is  zero.  Thus,  reliabibty  under  this  scenario  is  given  by 


where  B  =  N  ~  Tg  --  1. 

6  System  Fault  Tolerance  Strategies 

As  stated  previously,  we  assume  static  redundancy  management,  where  faulty  nodes  remain  in  the 
system;  neither  fault  isolation  nor  reconfiguration  is  considered.  Node  failures  are  assumed  to  be 
exponentially  distributed  with  failure  rate  A  =  10“^  over  a  one  hour  mission. 
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N 

2 

3 

4 

5 

6 

. 7 . 

8 

Cb 

1 

2 

3 

4 

5 

G 

t 

Cs 

1 

1 

2 

2 

3 

3 

Ca 

1 

1 

1 

2 

2 

Hs 

{/b,  fs,  fX) 

(<  2,0,0) 
(0,1,0) 

(<  3,0,0) 
(1.1,0) 
(0,1,0) 

(<  4,0,0) 
(<  2,1,0) 
(0,<  2,0) 

(<  5,0,0) 
(<  3,1,0) 
(1,2,0) 

(<  6,0,0) 
(<  4,1,0) 
(2,2,0) 
(1,2,0) 

(<  7,0,0) 
(<  5,1,0) 
(<:  3,2,0) 
(1,3,0) 
(0,<  3,0) 

!  ? 

1 _ 

(fa,  fs,  Ia) 

(<  2,0,0) 
(0,0,1) 
(0,1,0) 

(<  3,0,0) 
(1,1,0) 
(1.0,1) 
(0,1,0) 
(0,0,1) 

(<  4,0,0) 
(<  2,1,0) 
(<  2,0,1) 
(0,<  2,0) 
(0,1,1) 

(<  4,0,0) 
(<  2,1,0) 
(<  2,0,1) 
(0,<  2,0) 
(0,1,1) 

j<  5;b;o) 
(<  3,1,0) 
(<  3,0,1) 
(1.2,0) 
(1,1,1) 
(0,<  2,0) 
(0,1,1) 
(0,0,  <  2) 

Table  1:  Classical  and  HFM  Covered  Faults 


The  main  dependability  requirement  is  that  all  non-faulty  nodes  compute  “correct”  values  in 
the  presence  of  faults,  with  the  definition  of  correctness  specified  for  individual  scenarios.  For 
simplicity,  each  node  computes  a  data  value,  sends  it  to  all  other  nodes,  and  decides  on  a  correct 
value.  All  good  nodes  are  expected  to  arrive  at  the  same  value. 

Specific  definitions  for  computing  a  correct  final  value  determine  the  scenario  that  applies  to 
the  system  model.  In  scenarios  Cb  and  //g,  a  node  assumes  its  own  value  is  correct,  and  compares 
this  value  to  the  values  received  from  all  other  nodes,  thus  detecting  the  presence  of  faulty  nodes  in 
the  system.  In  scenarios  Cj  and  Hs,  a  node  applies  a  fault  masking  algorithm  to  the  set  containing 
its  personal  value  and  the  values  received  from  all  other  nodes.  The  node  adopts  this  voted  value  as 
the  correct  value.  In  scenarios  Ca  and  H^,  each  node  executes  an  interactive  consistency  algorithm 
to  achieve  agreement  among  the  non-faulty  nodes  upon  the  values  sent  by  every  node.  A  majority 
algorithm  is  then  applied  to  the  consistent  value  set  to  arrive  at  a  final  correct  value. 

We  next  present  several  design  strategies  based  on  this  simple  framework  to  further  illustrate 
the  implications  of  the  HFM.  These  strategies  are  derived  from  the  techniques  described  above  for 
computing  a  correct  final  value  under  each  scenario.  The  fault  detection  and  masking  algorithms 
used  in  the  scenario  are  further  specified  to  enhance  the  system  tolerance  to  mixed  faults.  Several 
different  assumptions  about  the  types  of  faults  and  their  relative  probabilities  are  made.  The 
characteristics  of  systems  defined  under  th"  various  strategies  are  then  comy'.red  and  contrasted 
by  applying  them  to  several  examples  in  §7. 

6.1  Strategy  1:  Classical  Single  Fault  Type  Models 

Our  baseline  strategy  employs  the  basic  single  fault  type  models  defined  in  §4.  Table  2  presents 
the  characteristics  associated  with  each  of  the  scenarios,  with  the  number  of  faults  tolerated  given 
in  Table  1  and  the  reliability  given  in  Table  6  for  all  three  scenarios  for  several  values  <  f  N.  I  be 
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Scenario 

Cb 

Cs 

Ca 

mmikmmi 

IHQHI 

BsUS 

Ba'JA 

FT  Tech 

naMBga 

Masking 

Masking 

Active 

Y 

None 

None 

Passive 

None 

Majority 

IC 

Fault  Prob 

fJLB  -  1 

Ms  =  1 

Ma  =  I 

Table  2:  Strategy  1 — Classical  Models 


main  difficulties  in  using  this  strategy  are  the  assumption  of  perfect  fault  coverage  and  the  use  of 
a  single  type  of  fault  tolerance  algorithm,  implementing  either  passive  or  active  redundancy. 

Wc  begin  by  examining  Cb,  as  shown  in  Table  6  for  up  to  8  nodes.  The  reliability  estimates  ob 
tained  under  this  assumption  are  overly  optimistic.  Under  this  model  perfect  reliability  is  achieved 
for  five  or  more  nodes,  even  though,  realistically,  it  may  be  impossible  guarantee  perfect  fault  cov 
erage.  Furthermore,  computed  values  are  merely  compared  and  a  detected  difference  is  flagged. 
This  model  neglects  the  occurrence  of  malicious  faults,  and  cannot  guarantee  that  all  good  nodes 
will  be  able  to  recognize  a  correct  value. 

If  scenario  Cs  applies,  symmetric  faults  can  be  masked,  and  each  good  node  will  compute  a 
correct  value  as  long  as  the  fault  assumptions  (type  and  number)  are  not  violated.  However,  the 
implementation  of  a  masking  algorithm  without  dedicating  additional  resources  to  fault  detection 
removes  the  system’s  ability  to  detect  faults.  Thus,  if  an  uncovered  fault  occurs,  the  good  nodes 
could  potentially  compute  incorrect  values  with  no  indication  of  any  fault.  Again,  the  reliability 
estimates  for  Cs,  shown  in  Table  6,  are  overly  optimistic. 

Unlike  previous  scenario  estimates,  the  reliability  estimate  under  scenario  Ca  is  overly  pes¬ 
simistic,  as  it  assumes  that  all  faults  are  arbitrarily  malicious.  Perfect  coverage  to  the  number  of 
faults  shown  in  Table  1  permits  the  system  to  tolerate  all  types  of  faults.  The  implementation  of 
interactive  consistency  or  convergence  fault  masking  algorithms  permits  the  correct  answer  to  be 
computed  on  all  good  nodes.  However,  faulty  nodes  are  not  detected.  The  effects  of  this  model  on 
system  reliability  are  shown  in  Table  6. 

6.2  Strategy  2 — HFM  with  Perfect  Coverage 

We  begin  our  discussion  of  the  HFM  under  the  assumption  of  perfect  coverage.  The  specific 
characteristics  of  each  scenario  under  the  HFM  are  given  in  Table  3.  Unlike  the  previous  strategy, 
mixed  fault  tolerance  techniques  are  employed,  permitting  detection  of  non-malicious  faults  in  all 
scenarios.  Masking  of  malicious  faults  in  scenarios  Hg  and  ensures  correct  computations  under 
that  fault  assumption.  Framing  checks  arc  used  to  detect  garbled  messages,  and  then  hybrid  passive 
redundancy  techniques  are  used  to  mask  the  remaining  faults.  The  nominal  values  of  r  are  assumed, 
with  Tg  =  0.  Tg  =  <Smaxj  and  —  An«x-  The  fault  combinations  that  can  occur  for  scenarios  Ha 
and  Hs  appear  in  Table  1.  Unreliability  estimates  imder  these  scenarios  under  Strategy  2  are 
given  in  Table  6.  The  unreliability  and  faults  covered  for  Hb  are  identical  to  those  given  for  Cb- 
Although  the  possibility  of  uncovered  faults  is  neglect*  in  scenario  ifg,  the  reliability  obtained 
still  exceeds  that  of  the  classical  fault  models,  as  can  be  seen  by  comparing  the  unreliabilities  for 
the  two  scenarios  in  Tables  6.  If  scenario  Ha  applies,  then  the  reliability  estimates  are  improved 
by  at  least  a  factor  of  10,  as  can  be  seen  by  examining  the  entries  corresponding  to  Ha  in  Table  6 
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Scenario 

Hg 

. 

Ha 

Fault  Set 

B 

Bus 

BuSuA 

FT  Tech 

Hybrid  Detection 

Hybrid  Majority 

Hybrid  IC 

Active 

Framing,  Compare 

Framing 

Framing 

Passive 

None 

Hyb  Majority 

Hyb  1C 

(1,0,0) 

(.98,  .02,0) 

(.98, .01, .01) 

Table  3:  Strategy  2 — HFM  with  Perfect  Coverage 


Scenario 

Hg 

Hs 

_ Ha_ . . . 

Fault  Set 

B 

Bus 

Bu5u  A 

FT  Tech 

Hybrid  Detection 

Hybrid  Majority 

Hybrid  1C 

Active 

Framing,  Compare 

Frximing 

Framing 

Passive 

None 

Hyb  Majority 

Hyb  IC 

(.98,  (.02)) 

(.98,  .015,  (.005)) 

(.98,  .015,  .005) 

Table  4:  Strategy  3 — HFM  with  More  Practical  Fault  Coverage 


and  Ca  in  Table  6.  The  fault  combinations  given  in  Table  1  for  the  two  scenarios  are  also  valid  for 
all  strategies. 

6.3  Strategy  3:  HFM  with  More  Practical  Fault  Coverage 

As  with  the  classical  models  in  Strategy  1,  the  HFM  scenarios  in  the  previous  strategy  all  assume 
perfect  fault  coverage.  The  reliability  estimates  reflect  this  assumption,  with  scenarios  Us  and  Ug 
achieving  near-perfect  reliability.  Suppose  the  conditional  probability  non-malicious,  symmetric 
malicious,  and  asymmetric  malicious  faults  are  assumed  to  be  (ig  =  ^^s  ~  -015,  =  .005),  as 

in  Table  4.  For  scenarios  Hg,  the  notation  (.98, (.02)),  used  in  Table  4,  means  that  the  probability  of 
a  covered  fault  is  /xg  =  .98,  with  a  probability  of  .02  that  an  uncovered  fault  will  occur.  The  entries 
under  Strategy  3  in  Table  6  contain  the  unrebability  estimates  for  both  Hg  and  Z/^,  recomputed 
to  account  for  the  potential  of  uncovered  fa’olts  which  cause  the  system  to  fail,  as  described  in  §5.2. 
While  these  estimates  may  actually  be  somewhat  pessimistic,  they  provide  a  lower  bound  on  system 
reliability  to  complement  the  upper  bound  computed  in  the  previous  strategy.  Since  covers  all 
fault  types,  the  corresponding  urueliability  estimates  are  identical  to  those  under  Strategy  2. 

6.4  Strategy  4;  HFM  with  More  Fault  Detection 

A  novel  feature  of  the  HFM  is  the  ability  to  transform  malicious  faults  into  non-malicious  ones  by 
including  more  fault  detection  mechanism  in  the  system.  By  definition,  malicious  faults  are  those 
whose  effects  can  not  be  detected  by  the  fault  tolerance  mechanisms  implemented  in  the  system. 
So,  the  inclusion  of  additional  detection  methods  should  decrease  the  conditional  probability  that 
a  fault  is  malicious  by  increasing  the  types  of  faults  that  can  be  detected.  The  dependence  of 
fault  type  on  system  factors  is  also  a  feature  of  the  commonly  used  duration  taxonomy,  in  that  the 
distinction  among  permanent,  intermittent  and  transient  faults  must  be  made  relative  the  to  the 
time  granularity  of  the  specific  application  and  mission  time.  However,  it  is  easier  to  add  additional 
fault  protection  than  to  change  the  application  or  mission  time  parameters.  Since  the  malice  of 
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Scenario 

Hb 

Hs 

Ha  . 

Fault  Set 

B 

B\jS 

BdSvjA 

FT  Tech 

Hybrid  Detection 

Hybrid  Majority 

Hybrid  1C 

Active 

Framing,  ECC 
Sanity,  Compare 

Framing,  ECC 
Sanity,  Compare 

Framing,  ECC 
Sanity,  Compare 

Passive 

None 

Hyb  Majority 

Hyb  IC 

(.99, (.01)) 

(a)(.99,.0l,0) 

(6)(.99,.008,(.002)) 

(.99,  .008,  .002) 

Table  5:  Strategy  4 — Hybrid  Fault  Model  with  More  Active  Redundancy 


2 

3 

4 

5 

6 

7 

^  . 

Cb 

l.OE-8 

9.7E-17 

0 

0 

BBDBI 

» 

Cs 

3.0E-8 

6.0E-8 

l.OE-11 

2.0E-11 

Ca 

6.0E-8 

1.5E-7 

3.5E-11 

5.6E-  n 

Strategy  2 

Hs 

1.2E-9 

1.2E-14 

1.7E-16 

1.4E-17 

0 

Ha 

■■■ 

2.4E-9 

4. IE-  11 

1.5E-11 

4.2E-14 

4.6E-16 

Strategy  3 

Hb 

4.0E-6 

6.0E-6 

8.0E-6 

l.OE-5 

1.2E-5 

1.4E-5 

1.6E-5 

Hs 

1.5E-6 

2.0E-6 

2.5E-11 

2.0E-6 

3.0E-6 

3.5E-6 

Ha 

2.4E-9 

4.1E-11 

3.8E-12 

4.2E-14 

4.6E-I6 

Strategy  4 

Hb 

2.0E-6 

3.0E-6 

4.0E-6 

5.0E-6 

7.0E-6 

9.0E  -  6 

//5(a) 

6.0E-10 

6.1E-12 

3.0E-15 

Bmi 

0 

0 

//5(b) 

T 

6.0E-7 

8.0E-7 

l.OE-6 

1.2E-6 

1.4E-6 

1.6E-6 

Ha 

1.2E-9 

l.OE-11 

6.1E-13 

l.lE-14 

2.8E-17 

Table  6:  Unreliability  for  All  Strategies 


a  fault  depends  upon  the  specific  fault  tolerance  techniques  implemented,  the  distinction  between 
malicious  and  non-malicious  faults  can  not  be  separated  from  the  system  design. 

Thus,  in  this  strategy,  the  additional  active  redundancy  techniques  employed  in  the  system 
should  cause  the  conditional  probability  of  malicious  faults  to  decrease.  Instead  of  relying  solely 
on  framing  checks  to  detect  faulty  node  behavio”  in  the  form  of  garbled  messages,  error  correction 
and  detection  codes  can  be  implemented  to  permit  information  redundancy  in  data  transmission  to 
reduce  the  probability  of  an  undetectably  fault  message.  Sanity  checks  are  employed  to  identify  data 
values,  in  properly  framed  messages,  that  are  outside  the  range  of  acceptable  values.  Comparison  of 
values  received  from  a  node  with  the  value  obtained  by  applying  passive  redundancy  techniques  to 
a  set  of  received  data  values  permits  detection  of  faults  that  were  malicious  in  the  previous  strategy. 
Thus,  the  assumed  conditional  fault  probabilities  /xj,  and  hb  can  be  adjusted  to  reflect  the 
additional  fault  coverage,  as  shown  in  Table  5.  The  resulting  increase  in  reliability  is  demonstrated 
in  Table  6  under  Strategy  4,  where  the  values  for  Hs  are  computed  with  the  perfect  fault  coverage 
assumption  (a),  and  without  it(b). 

We  next  apply  these  strategies  to  two  system  design  problems. 
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7  HFM  Application  to  System  Design 

As  evident  from  the  strategies  presented  in  the  previous  section,  the  classical  single  type  fault  sce¬ 
narios  contain  few  parameters  that  can  be  varied:  the  number  of  nod'*s,  the  number  of  rebroadcast 
rounds  in  the  interactive  consistency  algor’thm,  and  the  number  of  good  nodes  required  for  system 
operation.  The  hybrid  fault  scenarios  ex  )and  the  parameters,  permitting  improved  precision  in 
modeling  the  system  specified  by  the  design  requirements.  The  mixed  fault  types  of  the  HFM  pro¬ 
vide  dynamic  fault  tolerance,  with  linearly  increasing  system  reliability  as  a  function  of  increased 
system  size.  An  additional  advantage  of  the  HFM  is  the  use  of  fault  segregation  to  justify  design 
decisions.  For  example,  to  enhance  the  Byzantine  fault  coverage  from  1  to  2  faults,  the  number  of 
system  nodes  must  be  increased  from  4  to  7.  Under  the  Byzantine  model,  a  system  5  or  6  nodes 
actually  yields  lower  reliability  than  even  a  4  node  system  (See  Table  6,  Scenario  Ca).  However,  as 
we  showed  in  [l]  and  [11],  both  the  reliability  and  the  system  resilience  to  faults  other  than  Byzan¬ 
tine  increase  when  the  HFM  is  used,  justifying  the  use  of  additional  resources  without  increasing 
algoritlun  complexity  (See  Table  6,  Strategy  2,  Scenario  Ha)- 

To  demonstrate  the  use  of  the  HFM  in  making  design  decisions,  we  will  examine  several  sets  of 
design  requirements,  demonstrating  why  certain  strategies  and  scenarios  are  most  likely  to  achieve 
the  design  goals.  Tradeoffs  among  the  system  cost,  reliability,  fault  resilience,  and  system  requiie- 
mcuts  are  considered  in  defining  candidate  designs. 

7.1  Example  1 

Our  first  application  requires  a  system  of  at  most  five  nodes.  All  good  nodes  are  required  to 
compute  a  correct  value  and  to  flag  the  presence  of  some  faulty  nodes.  The  target  unreliability  is 
on  the  order  of  10“^.  The  probability  of  arbitrary  faults  is  assumed  to  be  negligible,  and  two  faults 
must  be  tolerated. 

We  first  examine  the  scenarios  under  Strategy  1  (§6.1).  We  can  immediately  reject  Cg  because 
good  nodes  are  not  guaranteed  to  compute  a  correct  value  in  the  presence  of  symmetric  faults. 
While  correct  values  will  be  computed  in  scenarios  Ca  and  Cs  and  a  four  node  system  satisfies  the 
reliability  requirements,  neither  scenario  detects  the  presence  of  even  one  faulty  node.  We  therefore 
reject  Strategy  1,  and  examine  the  strategies  under  the  HFM. 

Using  Strategy  2  (§6.2),  both  scenarios  Hs  and  Ha  satisfy  the  correctness  and  unreliability 
requirements.  Since  hybrid  algorithms  are  used,  node  faults  that  cause  a  message  to  fail  the 
framing  check  will  be  detected.  Since  the  probability  of  asymmetric  faults  is  negligible,  adopting 
the  Ha  scenario,  with  its  interactive  consistency  algorithm,  may  not  be  cost  effective.  So,  Hs  with 
four  nodes  would  appear  to  be  adequate  for  this  example,  with  its  unreliability  of  2.4E-11  from 
Table  6. 

However,  the  probability  of  an  asymmetric  fault,  while  negligible,  is  still  non-zero.  Thus,  the 
reliability  comp  .ted  for  Hs  under  Strategy  2,  assuming  perfect  fault  coverage,  may  be  overly 
optimistic.  The  umeliability  for  this  scenario  under  Strategy  3  (§6.3),  with  the  probability  of 
uncovered  asymmetric  malicious  faults  assumed  to  be  .005,  is  given  in  Table  6  as  2.0  E-6.  Thus, 
the  reliability  requirement  is  no  longer  satisfied.  Furthermore,  the  system  cannot  detect  potential 
corruption  of  the  final  value  by  an  asymmetric  malicious  fault. 

Although  many  system  designers  would  still  adopt  Hs  under  Strategy  2,  the  HFM  provides 
another  alternative  by  permitting  a  further  decrease  in  the  probability  of  undetectable  corruption 
due  to  an  asymmetric  malicious  fault.  In  Strategy  4  (§6.4),  additional  active  redundancy  techniques 
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are  implemented  in  the  system.  Messages  are  encoded  and  decoded  with  an  ECC,  sanity  checks 
are  applied  to  received  data  values,  and  the  voted  result  is  compared  to  all  original  values  to  detect 
the  presence  of  an  otherwise  undetected  fault.  In  this  way,  some  of  the  asymmetric  faults  that  were 
malicious  under  previous  strategies  are  transformed  into  non-malicious  faults  under  Strategy  4. 

For  example,  the  framing  check  imder  Strategy  3  would  not  have  detected  a  value  out  of  range 
in  an  otherwise  correctly  framed  message.  Unless  all  good  nodes  received  the  same  incorrect  value 
from  that  faulty  sender,  the  sender  has  committed  a  (malicious  asymmetric)  fault  which  could  cause 
the  system  to  fail  under  Strategy  3.  By  implementing  a  sanity  check  to  ensure  that  all  values  to  be 
voted  are  within  the  correct  range,  Strategy  4  can  mask  faults  of  this  nature.  Once  the  final  value 
is  computed,  the  comparison  of  received  values  with  the  final  value  then  permits  a  potential  fault 
to  be  flagged.  As  shown  in  Table  6,  the  reliability  of  this  system  under  scenario  Hs  in  Strategy  4, 
without  assuming  perfect  fault  coverage,  is  estimated  to  be  8.0  E-7,  which  is  in  the  desired  range. 
If  perfect  fault  is  coverage  assumed,  the  estimated  reliability  is  6.1  E-12. 

Based  on  our  analysis,  there  are  several  strategies  which  can  be  used  in  solving  our  example 
problem,  permitting  other  factors  to  be  addressed  in  choosing  the  final  implementation.  For  exam¬ 
ple,  scenario  could  also  be  applied  if  the  extra  round  of  rcbroadcast  could  be  justified  for  some 
other  reason,  such  as  the  elimination  of  a  single  point  of  failure.  Another  point  of  consideration 
is  the  requirement  that  two  faults  be  tolerated.  While  the  scenario  Hs  strategies  we  judged  In  be 
acceptable  were  all  capable  of  tolerating  two  faults,  no  four-node  implementation  could  tolerate  two 
symmetric  malicious  faults.  The  ambiguity  lies  not  in  o\ir  model,  but  in  the  prob'  .-m  specification. 
A  more  precise  definition  of  the  types  of  faults  to  be  tolerated  could  change  the  acceptable  scenario 
and  strategies  appreciably,  as  evident  in  the  next  example. 

7.2  Example  2 

We  now  increase  our  reliability  requirements  and  more  precisely  specify  the  faults  to  be  tolerated. 
All  good  nodes  are  required  to  compute  a  correct  value  and  to  flag  the  presence  of  some  faulty 
nodes.  The  target  unreliability  is  now  on  the  order  of  10"^°.  Up  to  eight  nodes  are  permitted,  and 
a  minimum  of  two  faults,  one  of  wliich  is  arbitrary,  must  be  tolerated.  Other  system  considerations 
require  that  the  weight  and  cost  be  minimized. 

We  first  examine  the  classical  single  fault-type  models  of  Strategy  1.  The  requirement  that  the 
system  tolerate  one  arbitrary  fault  immediately  removes  scenarios  Cg  and  Cs  from  consideration. 
Since  the  system  must  tolerate  two  faults,  scenario  requires  an  8  node  system,  with  the  reliability 
given  in  Table  6  as  5.6E-11.  To  permit  scenario  to  detect  some  faults,  a  comparison  function 
can  be  implemented  to  permit  a  node  to  detect  differences  between  the  final  value  and  the  original 
value  sent  by  each  node.  With  this  modification,  scenario  Ca  satisfies  the  requirements  of  this 
example. 

However,  if  another  strategy  and  scenario  can  be  found  which  satisfy  the  same  requirements  with 
fewer  system  nodes,  then  the  need  to  minimize  cost  and  weight  would  make  that  scenario  a  more 
practical  candidate.  An  examination  of  the  properties  of  scenario  under  either  Strategy  3  (§6.3) 

or  Strategy  4  (§6.4),  given  in  Tables  4,  5,  and  6,  shows  that  the  six  node  configuration  satisfies  the 
problem  requirements.  The  choice  of  strategy  could  then  be  made  based  on  the  cost  of  implementing 
the  additional  active  redundancy  techniques  used  in  Strategy  4. 
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8  Conclusion 


In  this  paper  we  have  presented  an  overview  of  both  the  classical  single  fault  type  models  and  tlie 
mixed  fault  type  HFM.  Using  the  hybrid  fault  taxonomy,  we  compared  the  fault  tolerance  of  both 
classes  of  models.  After  providing  closed  form  reliability  expressions,  we  examined  the  reliability, 
fault  resiliency  and  other  system  characteristics  for  a  variety  of  system  design  strategies.  We  then 
applied  the  strategies  to  two  examples,  demonstrating  the  impact  of  the  hybrid  fault  model  upon 
the  decision  process  in  designing  systems  under  a  variety  of  constraints. 

The  major  problem  in  applying  this  type  of  technique  to  system  design  or  analysis  is  the  need 
to  estimate  node  failure  rates  and  the  conditional  probabilities  of  mixed  fault  types.  Thus,  the 
comparisons  provided  in  discussing  the  different  system  strategies  should  be  applied  according  to 
the  relative  probabilities  of  various  fault  types,  as  few,  if  any,  adequate  techniques  or  data  currently 
exist  to  provide  precise  estimates  of  the  occurrences  of  mixed  fault  types.  Another  difficulty  is 
ambiguous  or  incomplete  system  requirements,  which  do  not  provide  an  accurate  representation  of 
the  desired  system.  This  too  represents  an  ongoing  research  problem  that  attempts  to  transcend 
the  limitations  of  natural  language  in  specifying  systems. 

We  are  currently  extending  our  work  to  address  dynamic  redundancy  management,  requiring 
on-line  fault  detection,  diagnosis,  isolation,  and  system  reconfiguratio- .  Several  facets  of  the  diag¬ 
nostic  process  which  were  not  relevant  under  the  static  redundancy  assumption  need  to  be  examined 
carefully.  The  detection  mechanisms  assumed  in  many  of  the  strategies  presented  above  will  need 
to  be  expanded  to  lessen  the  ambiguity  inherent  in  distributed  decision  making.  Also,  the  impor¬ 
tance  of  minimizing  the  diagnosis  process  bandwidth  becomes  more  important  than  minimizing  the 
amount  of  information  exchange  required  to  mask  malicious  faults. 

The  examples  in  tliis  paper  were  kept  simple  to  prevent  the  difficulties  encountered  in  designing 
large,  complex  systems  from  hiding  the  decision  processes  supported  by  the  HFM.  However,  many  of 
the  fault  tolerance  techniques  implemented  in  the  Multicomputer  Architecture  for  Fault  Tolerance 
(MAFT)  system  [17,  18]  were  chosen  based  on  considerations  similar  to  those  we  described.  Our 
current  work  includes  the  application  of  the  HFM  to  more  realistic  design  problems. 


References 

[l]  M.  M.  Hugue,  “The  hybrid  fault  and  reliabdity  model  for  distributed  systems,”  ONR  contract 
teclmical  report  (submitted  to  rds92),  ATC  AS  AC,  Mar  1992. 

|2|  N.  Suri,  M.  M.  Hugue,  and  C.  Walter,  “Reliability  modeling  of  large  fault-tolerant  systems,” 
in  Proceedings,  22nd  Annual  International  Symposium  on  Fault  Tolerant  Computing,  p.  (to 
appear),  IEEE  Computer  Society,  1992. 

[3]  M.  M.  Hugue  and  N.  Suri,  “Approximate  agreement  and  the  hybrid  fault  model,”  ONR  con¬ 
tract  technical  report  (submitted  to  rtss92),  ATC  ASAC,  Mar  1992. 

[4]  B.  W.  Johnson,  Design  and  Analysis  of  Fault-Tolerant  Digital  Systems.  Addison- Wesley  Pub 
lishing  Company,  1989. 

|5)  J.-C.  Laprie,  “Dependability:  Basic  concepts  and  associated  terminology.”  February  1990. 


105 


j6]  M.  M.  Hugue,  “Fault  type  enumeration  and  classification,”  ONR  Technical  Report  ONR  TR 
91-05,  Allied-Signal  Aerospace  Company  ATC,  July  1991. 

[7]  K.  S.  Trivedi,  Probability  and  Statistics  with  Reliability,  Queuing,  and  Computer  Science  Ap¬ 
plications.  Englewood  Cliffs,  N.J.  07632:  Prentice  Hall,  first  ed.,  1982. 

(8j  R.  Geist  and  K.  Trivedi,  “Reliability  estimation  of  fault-tolerant  systems:  Tools  and  tech¬ 
niques,”  Computer,  vol.  23,  pp.  52-61,  July  1990. 

[9j  P.  Thambidurai  and  Y.-K.  Park,  “Interactive  consistency  with  multiple  failure  modes,”  in 
Proceedings,  Seventh  Symposium  on  Reliable  Distributed  Systems,  pp.  93-100,  IEEE,  October 
1988. 

flOj  F.  Meyer  and  D.  Pradhan,  “Consensus  with  dual  failure  modes,”  in  Proceedings,  Seventeenth 
International  Symposium  on  Fault  Tolerant  Computing,  pp.  48-54,  IEEE,  July  1987. 

[11]  P.  Thambidurai,  Y.-K.  Park,  and  K.  Trivedi,  “On  reliability  modeling  of  fault- tolerant  dis¬ 
tributed  systems,”  in  Proceedings,  9th  International  Conference  on  Distributed  Computing 
Systems,  June  1989. 

[12]  M.  Pease,  R.  Shostak,  and  L.  Lamport,  “Reaching agreement  in  the  prese.-  ‘‘faults,”  JACM, 
vol.  27,  pp.  228-234,  April  1980. 

(I3j  P.  Dolev  et  al.,  “An  efficient  algorithm  for  Byzantine  agreement  without  authentication,” 
Journal  of  Information  and  Control,  vol.  52,  pp.  257-274,  1982. 

[14]  D.  Dolev,  N.  Lynch,  S.  Pinter,  E.  Stark,  and  W.  Weihl,  “Reaching  approximate  agreement  in 
the  presence  of  faults,”  in  Proceedings,  Third  Symposium  on  Reliability  in  Distributed  Systems, 
pp.  145-154,  October  1983. 

[15]  H.  R.  Strong  and  D.  Dolev,  “Byzantine  agreement,”  in  Proceedings,  1983  IEEE  Compcon, 
pp.  77-81,  IEEE  Computer  Society,  1983. 

[16]  L.  Lamport,  R.  Shostak,  and  M.  Pease,  “The  Byzantine  generals  problem,”  ACM  Transactions 
on  Programming  Languages  and  Systems,  vol.  4,  pp.  382-401,  July  1982. 

[17]  C.  Walter,  R.  Kieckhafer,  and  A.  Finn,  “MAFT:  A  multicomputer  architecture  for  fault- 
tolerance  in  real-time  control  systems,”  in  Proceedings  of  the  IEEE  Real-Time  Systems  Sym¬ 
posium,  (Washington,  DC),  pp.  133-140,  IEEE  Computer  Society,  IEEE  Computer  Society 
Press,  December  1985. 

[18]  R.  Kieckhafer,  C.  Walter,  A.  Finn,  and  P.  Thambidurai,  “The  MAFT  architecture  for  dis¬ 
tributed  fault  tolerance,”  IEEE  Transactions  on  Computers,  vol.  C-37,  pp.  398-405,  April 
1988. 


106 


Design  Capture  for  System  Dependability 


Jeffrey  Zhou 

Allied-Signal  Aerospace  Technology  Center 
9140  Old  Annipolis  Road/MD  108 
Columbia,  MD  21045 
zhou@batc.allied.com 


Abstract 

It  is  essential  to  develop  a  set  of  system  views  wiiich  faithfully  and 
completely  specify  complex  computing  systems  &om  all  important  as¬ 
pects,  including  such  non-functional  attributes  as  system  dependabil¬ 
ity  and  performance.  In  this  paper,  we  describe  a  dependability  view 
which  can  be  employed  as  a  useful  design  tool  to  specify  and  ana¬ 
lyse  dependable  systems.  A  real-time  fault-tolerant  operating  system 
design  is  presented  as  a  real-life  case  of  using  the  dependability  view. 


1  Introduction 

The  development  of  a  set  of  fimdamental  system  views  to  capture  all  facets  of 
a  complex  system  design  is  essential  in  completely  specifying  such  systems. 
Each  system  view  should  present  an  important  aspect  of  the  system  and  a 
complete  set  of  views  is  needed  to  faithfully  specify  the  system. 

Five  system  views  are  commonly  used  in  computer  based  system  en¬ 
gineering  (CBSE):  Informational,  Functional,  Behavioral,  Environmental, 
and  Implementational  [l].  The  first  three  views  are  often  used  for  system 
specification  and  design,  while  the  last  two  views  provide  implementation 
constraints.  The  Informational  View  describes  the  system  components  and 
information  flow  among  those  components.  Both  system  partitions  and  ob¬ 
ject  relations  are  shown.  The  Functional  View  specifies  the  system  fimctions 
and  their  inputs  and  outputs,  describing  the  system  operations  and  its  re¬ 
sponse  to  stimuli.  The  Behavioral  View  describes  different  system  states 
and  their  transitions,  characterizing  the  dynamic  system  behavior.  Each  of 
these  three  views  captures  an  important  system  aspect  that  is  not  covered 
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by  the  other  two  views.  Since  all  three  views  portray  the  same  system,  re¬ 
lations  among  them  can  be  exploited  to  enstire  that  the  specihcations  are 
consistent.  A  mapping  matrix  can  be  used  to  establish  the  relation  between 
the  Informational  View  and  the  Functional  View.  Similarly,  the  state  tran¬ 
sitions  in  the  Behavioral  View  can  be  traced  in  the  Functional  View  by 
executing  a  series  of  fimctions.  The  current  system  development  technol¬ 
ogy  uses  these  three  views,  more  or  less  independently,  to  capture  system 
constructs.  A  number  of  computer-aided  system  engineering  (CASE)  tools 
have  also  been  built  based  on  these  views.  Nevertheless,  these  three  system 
views,  combined  with  the  Environmental  and  Implementational  views  are 
not  adequate  to  fully  specify  the  properties  of  complex  systems  designed  for 
distributed,  real-time,  and  mission-critical  applications. 

In  developing  the  Real-Time  Executive  Module  (RTEM),  an  operating 
system  kernel  for  real-time  and  fault-tolerant  computing,  we  found  that  two 
essential  views,  the  system  dependability  and  system  performance  (real¬ 
time)  views,  should  be  added  to  the  current  set.  In  this  paper,  we  concen¬ 
trate  on  the  dependability  design  of  the  RTEM  system.  RTEM  is  a  complex, 
software  operating  system  kernel  which  controls  a  distributed  computer  plat¬ 
form  for  real-time  and  mission-critical  applications.  In  its  original  design 
document,  all  three  available  views  were  used  to  describe  the  system  archi¬ 
tecture,  fvinctions,  and  behaviors  in  high-level  graphical  diagrams.  However, 
the  system  dependability  had  only  textual  descriptions  and  its  specifications 
were  embedded  in  the  specifications  of  system  functions  used  to  implement 
redmdancy  management.  Such  textual  descriptions  are  subject  to  different 
interpretations  and  usuaJly  present  too  much  detail  in  the  conceptual  design 
stage.  The  mixed  specifications  also  made  the  system  dependability  analysis 
less  abstract  and  more  implementation  dependent.  To  overcome  these  short¬ 
comings,  a  system  dependability  view  needs  to  be  defined  independently  in 
high-level  system  design  to  capture  system  dependability  constructs. 

In  this  paper,  we  discuss  a  method  which  is  used  in  the  RTEM  devel¬ 
opment  to  construct  a  graphical  representation  for  system  dependability. 
We  use  two  symbols,  Faixlt  Containment  Region  (FCR)  and  Redundancy 
Management  Technique  (RMT),  as  basic  building  blocks  for  constructing  a 
system  dependability  view.  A  system  can  be  partitioned  hierarchically  into 
multiple  FCR’s  and  each  FCR  contmns  some  RMT’s  to  define  redtmdancy 
management  techniques  used  in  its  region.  We  suggest  that  FCR  partition¬ 
ing  and  RMT  selection  be  based  on  a  hybrid  fault  model  and  its  associated 
redundancy  management  techniques  [2][3].  Although  the  dependability  view 
proposed  in  this  paper  is  not  a  formal  one,  it  provides  a  framework  for  future 
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formalization. 


2  Terminology- 

In.  this  section,  we  introduce  the  terminology  used  in  describing  the  depend¬ 
ability  system  view  and  the  hybrid  fault  model.  By  dependability,  we  mean 
the  qualitative  property  of  a  system  that  permits  justifiable  reliance  on  the 
services  delivered[8,  6].  We  use  the  standard  definition  of  reliability  as  the 
conditional  probability  that  a  system  is  operating  correctly  throughout  an 
interval  of  time,  given  that  ii.  was  operating  correctly  at  the  beginning  of 
that  interval.  The  number  of  faults  that  the  system  can  tolerate  without 
becoming  imdependable  is  its  resiliency. 

Resource  duplication  or  redundancy  is  used  at  the  information,  com¬ 
ponent,  or  computation  levels  to  ensure  fault-tolerance,  correct  operation 
in  the  presence  of  faults.  Active  redundancy  techniques  attempt  to  achieve 
fault-tolerance  through  fault-detection,  alone  or  in  conjunction  with  loca¬ 
tion  and  recovery.  Passive  redundancy  techniques  use  fault  masking  to  hide 
the  occurrences  of  faults  and  to  eliminate  the  effects  of  faults,  thus  avoiding 
errors.  Non-iterative  passive  redimdancy  techniques  require  a  single  roimd 
of  message  exchange.  Iterative  fault  masking  techniques,  such  as  interactive 
convergence  and  interactive  consistency,  require  additional  rounds  or  iter¬ 
ations  of  message  exchange  among  participants  [5].  Fault-tolerant  voting 
techniques,  such  as  majority  and  median,  are  non-iterative  passive  redim¬ 
dancy  techniques  on  which  iterative  passive  redund2mcy  techniques  are  oi'^en 
based.  Hybrid  redundancy  techniques  combine  active  redundancy  methods 
that  detect  faults  with  passive  methods  that  mask  the  remaining  faults. 

Redundancy  management  is  used  to  administrate  the  implemented  re¬ 
dundancy  techniques.  If  static  redundancy  management  is  used,  faulty  nodes 
or  components  remain  in  the  system;  neither  fault  isolation  nor  reconfigura¬ 
tion  is  considered.  If  dynamic  redundancy  management  is  employed,  faulty 
nodes  can  be  isolated  and  repair,  recovery,  or  reconfiguration  can  be  at¬ 
tempted. 

Unlike  previous  work,  we  place  no  limitations  on  the  duration  of  u  fault; 
transient,  intermittent  and  permanent  faults  are  supported.  The  hybrid  fault 
model  classifies  faults  according  to  the  errors  they  cause  and  the  techniques 
needed  to  tolerate  those  errors,  based  on  the  following  definitions.  The 
scope  of  a  fault  refers  to  the  portion  of  the  system  affected  by  that  fault, 
also  called  the  fault  extent.  A  symmetric  fault  generates  errors  that  are 
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manifested  identically  throughout  the  fault  icope,  Aa  a^ymmetnc  fault 
generates  errors  that  are  maxufested  diiTerently  throughout  the  fault  scope. 
Asymmetric  faults  are  potentially  more  severe  than  symmetric  faults. 

ffon-malieious  faults  can  and  will  be  detected  in  a  noo-faulty  node  by  the 
active  redundancy  techniques  implemented  in  that  node.  Mo/ieiotui  faults 
are  those  faults  that  cannot  be  detected  by  the  implemenled  active  redun 
dancy  techniques,  but  require  masking  using  ptusivc  redundancy  techniques. 

Combining  the  attributes  of  malice  and  symmetry  produces  the  three 
mutually  exclusive  and  collectively  exhaustive  fault  sets  that  make  up  the 
hybrid  model:  non-malicioua  faulU  (B),  maliciotu  eymmetrie  fault*  (5),  and 
maliciou*  ojymmetric  fault*  (w4).  The  «or»t~ea*t,  or  most  severe,  fault*  in 
T  are  those  in  A.  Faults  in  S  are  less  severe  than  the  faults  in  A,  but 
are  more  severe  than  faults  m  B.  The  system  dependability  view  presented 
below  assumes  the  hybrid  fault  model  is  used  in  specifying  the  system’s  fault 
handling  capacity. 

3  System  Dependability  View 

The  system  dependability  view  used  La  the  RTEM  development  is  a  graphical 
notation  which  employs  two  symbols  a«  well  as  their  relations  to  describe  sys 
tern  dependability.  The  two  symbols  axe  Fault  Containment  Region  (FCR) 
and  Redundancy  Management  Technique  (RMT).  An  FCR  is  defined  as  a  re¬ 
gion  beyond  which  a  certain  number  and  type  fault  carmot  propagate.  RMT 
is  the  technique  used  to  fulfill  the  FCR  objective.  A  system  dependability 
design  then  can  be  described  by  using  these  symbols,  either  in  a  top-down 
or  a  bottom-up  fashion. 

If  a  top-down  design  method  is  adopted,  a  candidate  system  can  be 
partitioned  into  several  FCR’s  with  coverage  for  the  different  fault  types 
of  the  hybrid  fault  model.  For  example,  a  system  may  require  that  its 
data  receiving  subsystem  to  detect  only  non-malicious  faults.  Thus,  this 
subsystem  can  be  treated  as  one  FCR.  On  the  other  hand,  a  central  control 
subsystem  executing  the  control  logic  for  weapon  launch  may  have  to  tolerate 
any  type  of  fault  including  Byzantine  faults  (5).  This  requirement  defines 
smother  FCR  with  much  stronger  fault  resiliency  in  its  region.  An  FCR 
usually  has  a  territory  bound  by  natural  hardwarc/soflware  components. 
It  also  indicates  its  fault-tolerant  capability  with  the  number  and  types  of 
faults  tolerated  in  its  region. 


1 

i 
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Figure  1:  A  simple  dependability  view 


A  system  may  have  several  top-level  FCR’s  and  each  FCR  can  be  un¬ 
folded  to  show  a  hierarchical  structure.  Figure  1  shows  a  simple  depend¬ 
ability  view  with  one  layer  of  FCR  hierarchy.  The  top-level  FCR  has  three 
sub-level  FCR’s,  each  of  which  can  tolerate  only  non-malicious  faults.  In 
other  words,  if  a  malicious  fault  exists,  for  'ostance  in  FCRl,  it  will  not 
be  detected/corrected  and  may  manifest  errors  in  the  FCRl  outputs  to  the 
FCR  region.  If  the  top-level  FCR  is  required  to  stop  the  fault  propagation, 
or  it  cannot  afford  errors  caused  by  the  fault  in  its  outputs,  an  RMT  must 
be  used  to  tolerate  the  fault.  In  figure  1,  the  RMT  is  a  distributed  ma¬ 
jority  voting  algorithm,  and  the  top-level  FCR  will  be  able  to  tolerate  one 
malicious  symmetric  fault  and  multiple  non-malicious  faults. 

It  is  interesting  to  note  that  the  RMT  implementation  will  affect  FCR 
capability.  For  example,  if  the  RMT  in  figure  1  is  changed  to  the  Triple 
Module  Redimdancy  (TMR)  technique,  a  single  voter  is  introduced  into  the 
original  FCR.  It  becomes  a  new  fault  containment  region  FCR4  as  shown  in 
figure  2.  That  changes  the  overall  FCR  resiliency.  The  FCR  can  no  longer 
constrain  the  propagation  of  a  malidotu  fault  if  it  is  originated  in  FCR4 
and  linked  to  FCR  outputs.  To  preserve  the  original  FCR  dependability, 
FCR4  itself  must  be  able  to  cover  at  least  one  malicious  symmetric  fault  as 
shown  in  figure  3.  The  FCR4  then  should  have  at  least  three  voters  and 
multiple  links  which  perform  cross  voting.  Please  note  that  an  FCR  may 
change  its  dependability  while  still  maintain  its  reliability.  For  instance, 
although  the  FCR  in  figure  2  changes  its  dependability,  it  may  still  be  able 
to  maintam  its  reliability  if  a  highly  reliable  component  is  used  for  the  single 
voter.  While  expressing  system  dependability  explicitly,  the  Dependability 
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Figure  2;  Dependability  view  with  TMR 


Figure  3;  Multiple  voters  in  a  separate  FCR 
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View  also  provides  a  mean  for  system  reliability  anaJyiis  [21. 


4  Real-Time  Executive  Module 

In  this  section,  we  present  a  real-life  system  design  case  to  show  how  the  De 
pendability  View  can  be  used  to  enhance  design  capability  for  fault- tolerant 
computer  systems.  RTEM  is  a  software  operating  system  kernel  designed  to 
control  a  distributed  computing  platform  for  real-time  and  mission-critical 
applications.  The  goal  of  our  research  is  to  develop  a  set  of  system  executive 
functions  which  are  portable  to  microprocessor  based  computing  systems. 
These  executive  functions  can  be  loaded  to  each  processor  in  a  distributed 
computing  environment  and  perform  both  time  management  and  redun 
dancy  management  for  a  single  node,  as  well  as  for  the  complex  system. 
In  the  first  phase  of  research,  we  are  developing  a  concept-proof  prototype 
which  will  be  used  as  a  flexible  system  model  to  test  and  verify  design 
concepts  for  the  Multicomputer  Architecture  for  Fault- Tolerance  (MAFT) 
[9],  In  this  paper,  we  concentrate  on  the  discussion  of  system  dependability 
design. 

The  dependability  requirement  for  the  RTEM  prototype  i^  to  design  a 
system  that  is  able  to  tolerate  a  single  fault  of  any  type:  non-malicious, 
malicious  symmetric,  or  malicious  asymmetric.  In  order  to  cover  a  sin¬ 
gle  malicious  asymmetric  fault,  the  protot3rpe  needs  a  minimum  four  node 
configuration  and  a  fully  coimected  communication  network  to  perform  the 
Interactive  Consistency  AJgorithm  [4,5].  Figure  4  contains  the  Informa¬ 
tional  View  of  the  system,  which  shows  object  partitioning  and  the  infor¬ 
mation  flow  among  components.  Figure  5  illustrates  the  Functional  View, 
in  which  seven  main  functions  define  the  system  operation.  These  fimctions 


Figure  4:  RTEM  informational  view 
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Figiire  5;  RTEM  function*!  view 


axe  loaded  into  each  node  in  the  distributed  platform,  and  provide  the  core 
processes  of  time  and  redundancy  management.  If  a  faulty  node  is  detected, 
dynamic  redundancy  management  is  required  to  reconfigure  the  system. 

The  system  has  three  main  operation  states:  cold-start,  steady-state, 
and  reconfiguration.  The  diagram  of  operation  state  transition  is  shown  in 
figure  6.  When  system  is  powered  up,  it  is  in  the  cold-start  state.  During 
this  period,  ail  nodes  try  to  synchronize  with  each  other  to  form  a  commo-, 
operating  set.  The  Interactive  Consistency  Algorithm  is  computed  so  that 
all  nodes  can  have  a  consistent  common  view  about  their  membership  in  the 
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Figiue  6:  RTEM  behavioral  view 


operating  set.  Once  the  operating  set  is  formed,  it  triggers  the  transition 
from  cold-start  to  the  steady-state  operation  which  is  the  main  operation 
state  for  the  RTEM  system.  The  transition  from  steady-state  to  reconfigu 
ration  mode  is  triggered  if  a  faulty  node  is  detected  uy  the  system. 

Constructing  a  dependability  view  is  an  attempt  to  establish  a  common 
language  as  a  design  aid  for  system  engineers  to  capture,  describe,  and  an¬ 
alyze  fault-tolerant  systems.  The  RTEM  dependability  view  defines  fault 
containment  regions  based  on  system  partitioning  and  a  fault  containment 
architecture  based  on  the  system  hierarchy.  The  dependability  view  also 
states  the  redundancy  management  techniques  that  are  used,  and  how  they 
are  applied  to  tolerate  faults  in  a  particular  region.  The  graphical  repre¬ 
sentation  provides  a  tool  for  system  engineers  to  use  common  symbols  and 
syntax  to  discuss  system  dependability  design. 

An  FCR  usually  has  a  territory  bound  by  natural  hard  ware/soft  ware 
components.  The  entire  RTEM  system  can  be  considered  as  one  FCR  which 
has  the  capability  of  tolerating  smy  type  of  a  single  fault.  In  the  FCR, 
there  are  several  sub-level  FCR’s  which  provide  necessary  hardware/softwarc 
redundancy  to  realize  the  fauJt- tolerant  capability  of  the  top-level  FCR.  The 
partitioning  is  based  on  hardware  components  and  es^  processor  can  be 
naturally  defined  as  a  fault  containment  region.  Since  all  four  nodes  are 
fully  connected,  the  communication  network  can  also  be  divided  to  four 
broadcasting  channeb  and  each  of  them  can  be  included  into  its  processor 
based  fault  containment  region.  Therefore,  no  separated  sub-level  FCR  is 
used  to  represent  the  network  component.  In  the  following  discussion,  we 
will  use  FCR  to  refer  the  system  level  fault  containment  region.  Component 
level  fatilt  containment  regions  will  be  referred  as  FCRi  for  an  individual 
node  or  FCRs  for  multiple  nodes. 

After  partitioning,  we  need  to  determine  the  fault-tolerant  capability  for 
each  FCRi.  There  are  a  number  of  options,  and  the  design  choice  must 
be  made  to  meet  dependability  requirements  for  both  components  and  the 
system.  In  RTEM,  a  node  FCRi  is  not  required  to  contain  malicious  faults. 
In  other  words,  if  a  malicious  fault  is  originated  in  an  FCRi,  it  may  appear  in 
FCRi  outputs.  Nevertheless,  it  will  not  propagate  through  the  system  FCR 
region  because  of  the  redtmdancy  techniques  used  among  FCRs.  Under 
this  requirement,  each  FCRi  has  a  single  processor  executing  only  active 
redundancy  techniques  for  fault  detection.  Each  FCRi  includes  as  many  as 
25  different  error  detection  mechanisms;  so,  a  wide  ranges  of  faults  can  be 
detected.  A  detected  fault  is  then  isolated  and  masked  in  the  FCR  region. 

By  the  definition  of  fault  containment  region,  FCRi  errors  can  be  moni- 
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Fig\ire  7:  RTEM  dependability  view 

tored  only  through  message  exchanges  with  other  FCRs.  A  message  can  be 
a  system  state  vector,  task  scheduling  vector,  application  data  values,  etc. 
There  are  a  number  of  messages  flowing  arotmd  the  system.  If  a  node  be¬ 
comes  faulty,  it  may  generate  erroneous  messages  and  broadcast  them  as  the 
output  of  its  FCRi.  In  order  to  fulfill  the  dependability  objective  of  the  sys¬ 
tem  FCR,  appropriate  RMT,  redundancy  management  techniques,  should 
be  applied  to  detect  and  correct  those  erroneous  messages.  The  faulty  node 
should  be  properly  identified  and  then  gracefully  excluded  from  the  system. 

After  defining  FCRs  and  RMTs,  we  can  construct  the  dependability  view 
for  the  RTEM  system  as  illustrated  in  figure  7.  The  view  then  can  be  used 
to  discuss  system  dependability  design.  It  is  obvious  that  the  key  issue  is  to 
select  adequate  redimdancy  techniques  for  the  system  to  tolerate  a  malicious 
asymmetric  fault.  If  the  system  can  cover  a  malicious  asymmetric  fault,  it 
can  cover  a  single  fault  of  other  types  too.  As  shown  in  the  dependability 
view,  three  different  RMTs  are  used  in  RTEM.  RMTl  computes  the  Inter¬ 
active  Consistency  Algorithm  for  certain  system  messages  so  that  all  nodes 
wUl  have  a  consistent  common  view  for  five  important  system  data  struc¬ 
tures.  They  are:  system  operating  set,  task  complete /start  vector,  system 
state  vector,  error  vector,  and  penalty  count  for  individual  node.  If  a  ma¬ 
licious  asymmetric  fault  causes  errors  in  these  messages,  it  will  be  properly 
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masked  by  all  good  nodes  in  exactly  the  same  way.  Maintaining  consistency 
is  a  necessary  condition  to  achieve  the  RTEM  design  goal:  a  good  node 
will  never  be  commonly  accused  by  majority  nodes  and  a  bad  node  will  be 
commonly  identified  by  all  good  nodes. 

In  the  RTEM  dependability  view,  system  synchronization  and  applica¬ 
tion  data  are  the  only  two  messages  which  are  not  covered  by  RMTl  tech¬ 
nique.  Instead,  they  use  majority  voting  algorithms  to  protect  data  integrity. 
Since  a  simple  majority  voting  cannot  detect  or  mask  malicious  asymmetric 
faults,  the  system  may  not  be  able  to  maintain  the  desired  resiliency  if  such 
faults  manifest  errors  in  these  two  messages.  Computing  the  Interactive 
Consistency  check  for  all  messages  would  be,  of  course,  a  safe  design  choice. 
However,  the  Interactive  Consistency  Algorithm  is  very  expensive  in  terms 
of  using  network  bandwidth.  To  cover  a  single  asymmetric  fault,  the  system 
needs  two  rotmds  of  broadcasting  and  voting  [5].  The  first  rotmd  broadcasts 
N  copies  of  a  message,  and  the  second  rotmd  rebroadcasts  copies  of  the 
message.  Then,  the  system  is  able  to  reach  a  consistent  view  about  the  mes¬ 
sage  in  N  nodes  by  voting  on  the  copies.  The  conservative  design  option 
is  unacceptable  to  the  RTEM  system  because  large  amount  of  application 
data  will  result  in  very  poor  system  performance.  A  design  trade-off  has  to 
be  made  between  system  dependability  and  performance. 

Let  us  consider  that  an  erroneous  data  value  is  generated  by  one  node 
2Lnd  broadcast  differently  to  other  nodes,  which  aJso  receive  good  copies  of 
the  data  from  healthy  nodes.  If  the  erroneous  data  value  has  an  acceptable 
deviance  from  good  copies,  it  will  not  be  excluded  from  voting.  After  voting, 
the  system  may  keep  different  data  values  of  the  same  data  message  in  good 
nodes.  That  can  be  a  malicious  asymmetric  fault.  However,  if  homogeneous 
hardware  and  software  is  used  for  all  nodes,  computing  results  are  kept  the 
same  for  good  nodes.  Under  this  assumption,  the  simple  majority  voting 
is  adequate  to  detect  and  mask  malicious  faults  no  matter  it  is  symmetric 
or  asymmetric.  The  shortcoming  of  homogeneous  hardware  and  software 
is  that  the  system  is  vulnerable  to  a  generic  fault.  N-version  designs  and 
implementations  are  commonly  used  to  cover  generic  faults.  Nevertheless,  it 
introduces  deviance  too.  Since  the  goal  of  RTEM  is  to  develop  a  set  of  system 
executive  functions  for  a  wide  range  of  applications,  we  design  the  system  in 
such  a  way  that  it  supports  both  homogeneous  and  N-version  applications. 
We  leave  the  choices  to  application  designers  who  should  decide  what  type 
of  fault- tolerant  system  they  want  to  have.  A  similar  idea  is  applicable  to 
synchronization.  A  detailed  discussion  is  beyond  the  scope  of  this  paper  and 
can  be  foimd  in  the  reference  [10]. 


5  Summary 

The  dependability  view  proposed  in  this  paper  is  an  attempt  at  constructing 
an  independent  system  view  in  order  to  capture  design  for  highly  dependable 
systems.  The  two  symbols,  FCR  and  RMT,  are  powerful  because  they 
are  closely  associated  with  components  and  functions  which  are  two  basic 
building  blocks  in  system  design. 

We  are  currently  formalising  the  definition  and  property  of  FCR  for 
dependable  system  design.  The  partitioning  of  systems  into  physically  inde¬ 
pendent  segments  has  long  been  used  to  enhance  system  reliability.  For  an 
ultra-reliable  redundant  system,  such  partitioning  is  crucial  to  the  system’s 
dependability.  Each  independent  segment  is  a  natural  FCR  and  is  designed 
to  limit  the  physical  damage  of  a  fault  within  a  region  to  that  region  [11],  and 
to  localize  the  area  in  which  fault  recovery  and  repair  are  required  [12|.  The 
system  dependability  view  is  based  on  an  extension  of  the  definition  of  FCR 
to  include  the  attributes  necessary  to  define  both  physical  and  functional 
partitioning  of  hardware,  software  of  functions,  and  to  unify  the  concepts  of 
fault  and  error  containment. 

We  are  also  investigating  a  taxonomy  of  RMT’s  based  on  the  hybrid 
model.  The  taxonomy  will  provide  a  systematic  guidance  for  system  en¬ 
gineers  to  select  proper  RMT’s  to  meet  system  dependability  requirement. 
The  formal  FCR,  combined  with  the  RMT  taxonomy,  will  establish  a  foim- 
dation  for  constructing  a  formal  system  dependability  view. 
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Abstract 

Developing  fauU-tolerant  real-time  systems  is  complicated  because  they  are  often  comprised 
of  many  special-purpose  architecture  components  with  non-standard  interfaces.  Furthermore, 
programming  languages  for  these  systems  typically  complicate  the  expression  and  enforcement 
of  timing,  concurrency,  and  fault-tolerance  requirements.  In  this  paper,  we  present  the  RTC 
programming  environment  that  we  are  developing  for  the  IEEE  POSIX. 4  standard  interface. 
The  ilTC  environment  integrates  explicit  expression  for  timing  and  concurrency  constraints  and 
provides  mechanisms  for  programming  time  fault-tolerance.  We  show  how  /{TC run-time  system 
enforces  these  constraints  using  the  IEEE  POSIX. 4  interface.  This  use  of  a  standard  interface 
mitigates  some  of  the  complication  present  in  designing  fault-tolerant  real-time  applications. 


1  Introduction 

In  real-time  applications  such  as  subinarine  command  control  and  avionics,  there  are  both  timing 
constraints  and  shared  resource  consistency  constraints  that  must  be  predictably  met.  Those  ap 
plications  are  often  controlled  by  a  distributed  system  to  match  the  distributed  topology  of  the 
components  and  to  provide  better  performance  through  concurrency.  Furthermore,  many  of  these 
applications  control  delicate  applications  with  severe  consequences  if  these  constraints  arc  violated 
and  left  untreated.  Programming  such  distributed  real-time  systems  to  meet  liming  constraints, 
consistency  constraints,  and  provide  fault-tolerance  is  a  complex  undertaking. 

There  are  several  current  practices  that  add  to  the  complication  of  developing  fault-tolerant 
real-time  systems.  Most  work  in  fault-tolerance  for  real-time  systems  has  used  special-purpose 
architecturi''  that  incorporates  techniques  such  as  redundancy  and  voting  [1,  2].  U.sing  special- 
purpose  architectures  makes  the  problem  of  programming  real-time  systems  difficult  l)erau.se  it  is 
usually  not  possible  to  take  advantage  of  existing  software  tools.  Also,  software  tlr  is  di  veloped 
Rn  this  paper  we  use  the  term  architecture  to  refer  to  a  syslctr’s  underlying  hardware  and  operating  system 
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Figure  1;  Design  of  Complex  Fault-Tolerant  Rcal-'Fiine  Systems 

for  such  architectures  is  not  usually  portable  or  re-usable.  This  plienomena  is  depicted  in  Figure 
1,  where  an  extra  layer  of  device-specific  code  is  required  to  coordinate  the  various  hardware 
components  to  reabze  an  application.  Since  this  layer  is  special-purpose  and  tailored  to  both  the 
liardware  and  appbeation,  the  system  can  not  be  easily  modified  or  ported. 

.Another  problem  adding  to  the  complexity  of  system  involves  the  use  ianguage.s  such  as  .Ada  !3i 
for  application  programming.  Ada  requires  that  static  priorities  bo  assigned  to  tasks  to  “ex;)re'-s‘ 
timing  constraints.  Since  timing  constraints  are  not  explicitly  stated,  but  are  hidden  in  the  relative 
priorities  of  tasks,  constraints  are  difficult  to  write,  verify  and  modify.  Detecting  and  recow>ring 
from  constraint  violations  is  also  complicated  by  tlie  constraints  being  hidden  Other  difficult 
with  Ada  for  real-time  programming,  such  as  its  use  of  mutual  exclusion  with  tlio  pos.sibiliiy  of 
priority  inversion,  are  mentioned  in  [4,  5]. 

To  address  these  difficulties  in  designing  complex  faull-loleranl  real-time  systems,  we  have 
designed  the  /?TC  software  system  tliat  allows  the  explicit  expression  of  timing,  consi.stenry.  anC 
reliability  constraints  in  the  application  program,  and  supports  their  eiiforcemoni  while  executing 
on  a  standard  interface  to  the  arcliitecture.  I  he  RTC  programming  language  ronstruct,s  support 
concurrent  real-time  programming  by  combining  an  abstract  data  type  paradigm  witli  a  transaction- 
based  paradigm  while  adding  provioions  for  explicitly  expressing  timing  and  precedence  constraints. 
The  constructs  are  designed  to  bo  embedded  in  a  host  language;  our  current  implemeiifation  is  in  ('. 
but  other  host  languages,  such  as  .Ada.  can  also  be  used.  The  RTC  run-time  system  uses  a  special 
form  of  locking  of  shared  resources,  including  procc-ssors.  to  allow  a  prioi it  v-based  scheduler  in  the 
architecture  to  enforce  the  constraints  expressed  in  the  program  without  \iolating  shared  resource 
consistency  constraints.  To  achieve  jiortability  and  software  re-usability  among  many  widely-used 
architectures,  wo  are  implemcmting  the  RTC  programming  envirotimenl  to  exemle  "on  top  of  " 
aichiteciure-,  that  adhere  to  the  proposed  IFFF.  POSIX  1003.4a  standard  for  real-time  operating 
system  iiUorface.s.  That  is.  the  dovice-specifif  code  is  hidden  under  a  standard  PO.SIX  iitlerface.  as 
shown  in  Figure  I.  This  design  allows  all  application  software  to  a.sstime  a  standard  interface  to  the 
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architecture,  tlius  allowing  software  development  that  is  independent  of  the  underlying  architecture 
and  eliminating  the  need  for  device-specific  code  in  the  development  of  a  complex  application. 

We  use  this  RTCfVO^iW  interface  to  support  fault-tolerance  in  the  ai)plication  softwans  in.stead 
of,  or  in  addition  to,  fault-tolerance  techniques  in  the  underlying  architecture.  'J'he  form  of  fault 
tolerance  that  wo  address  in  this  paper  deals  primarily  with  timing  faults,  which  occur  when  an 
application  violates  its  timing  constraints.  Timing  faults  can  occur  for  reasons  such  as  a  component 
failure  or  a  transient  overload.  Since  meeting  timing  constraints  is  required  for  correct  execution 
in  a  real-time  system,  tolerance  of  timing  faults  involves  ensuring  that  the  system  can  achieve  a 
consistent  state  {i.e.  a  state  that  meets  safety  requirements)  when  a  timing  fault  occurs.  For 

I 

instance,  if  a  submarine  control  program  either  misses  its  timing  constraints  or  appears  that  it 
will  miss  its  timing  constraints,  the  control  program  should  allow  the  system  to  recover  without 
disaster. 

To  demonstrate  these  techniques,  we  will  use  a  submarine  application  called  MATE  (Manual 
Adaptive  TMA  (Target  Motion  Analysis)  Evaluator).  The  MATE  application  is  an  interactive 
display  application  that  is  used  to  compute  a  possible  range,  course  and  speed  of  a  sonar  contact. 
MATE  consists  of  three  major  activities,  the  sonar  input  system,  the  MATE  algorithm,  and  the 
displa.y.  The  sonar  input  system  simulates  the  output  of  a  submarine  sonar  system  and  produces 
contact  reports  called  Filtered  Input  Data  Units  (FIDUs)  at  a  20  second  rate.  The  MATE  algorithm 
uses  FIDUs  and  operator  selected  range,  course,  and  speed  settings  to  produce  a  data  point  on  the 
display.  The  processed  data  appears  on  the  display  as  an  unaligned  vertical  stack  of  dots.  When  the 
operator  enters  a  new  range,  course  or  speed  solution  for  the  contact,  the  FIDUs  arc  reprocessed 
by  the  MATE  algorithm  and  displayed.  When  the  dots  are  aligned  vertically,  the  operator  has 
found  a  possible  solution.  There  are  timing  constraints  on  the  input  and  generation  of  the  output 
that  must  be  met  for  correct  performance.  Concurrency  control  for  shared  resources,  such  as  the 
FIDUs,  must  also  be  provided.  We  show  how  the  RTC/POSIX  system  can  be  used  to  express  these 
constraints,  enforce  them,  and  handle  timing  faults. 

This  paper  is  organized  as  follows.  Section  2  presents  the  7?TClanguagc  constructs  and  their  use 
in  the  MATE  application.  Section  3  briefly  describes  the  proposed  IEEE  POSIX  real-time  standard, 
and  Section  4  describes  how  the  RTC  run-time  system  enforces  the  constraints  expressed  by  the 
constructs  by  using  a  POSIX  compliant  real-time  architecture.  Section  5  summarizes  strengths  and 
weaknesses  of  our  approach  for  supporting  the  development  of  complex  real-time  systems. 
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2  The  RTC  Programming  Environment 

The  RTC  prograinttiing  eiiviroiimoiit  consists  of  a  set  of  language  constructs  that  express  real 
lime  concurrency  constraints  in  a  host  programming  language,  and  a  run-tirne  system  that  uses  an 
underlying  operating  system  to  enforce  these  constraints.  Our  current  implementation  is  embedded 
in  tlio  C  language. 

2.1  RTC  Resources 

RTC  resource  constructs  [irovidc  ab.stract  views  of  .shared  system  entities  such  as  devices  and  data 
structures.  Each  resource  lias  private  data  structures  and  defines  a  set  of  actions  that  can  bo 
invoked  by  processes  to  examine  or  change  the  resource’s  private  data.  In  the  MATE  example, 
there  are  several  instances  of  resources  including  the  FTDU  Record  and  the  Display,  The  FIDli 
Record  requires  concurrency  control  between  tlie  Sensor  Processes  which  write  to  parts  of  the  FIDF’ 
Record  and  the  MATE  process  which  reads  from  all  of  the  FTDU  Record.  The  Display  resource 
requires  that  actions  on  it  be  atomic  so  that  no  incomplete  or  jumbled  displays  are  seen.  An  outline 
of  the  /fTC  definition  of  a  FIDU  resource  is  shown  in  F’igure  2.  F2ach  action  specifies  parameter.s 
for  exchanging  information  with  its  invoking  process  and  a  compatible  declaration  to  indicate 
permissible  overlapping  of  execution  of  the  action’s  execution  that  will  preserve  the  re.source's  state 
consistency.  F’or  instance,  in  f’igure  2,  a  FIDU.read  action  can  overlap  with  with  other  f'lDlLrcad 
actions.  However,  a  FlDU-write  can  not  overlap  with  any  other  action.  Note  that  the  compatibility 
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that  can  be  specified  is  more  general  than  typical  read/vvrite  compatibility,  aitijough  read/ s*.  me 
is  all  that  is  illustrated  in  this  example.  The  ITABORT  exception  handler  for  iuti(Ui>  >liowji  m 
Figure  2  is  called  by  the  run-time  system  when  the  process  that  calls  the  a<  tion  abotts  Us  call  !  a- 
discussed  later). 


2.2  RTC  Processes 


RTC  processes  constructs  express  how  the  application  uses  the  resources.  An  example  of  t  he  .M  .VJ  F 
process  expressed  with  constructs  is  shown  in  Figure  3. 

Action  Invocations.  A  process  may  invoke  actions  of  resources  sytu-hronously.  which  catises 
the  process  to  wait  for  action  invocation  to  return,  or  asynchronously,  which  causes  the  process  to 
continue  executing.  In  the  example  of  Figure  3,  process  At  ATE  issues: 

action  Display.output(out put. data) 

to  invoke  the  output  action  on  the  Display  resource  synchronously  with  parameter  output. data.  The 
call: 


action&c  (FIDU  jead.done)  FIDU.read  (FIDUJnfo) 
invokes  action  read  on  resource  FIDU  with  parameter  FIDUJnfo.  This  call  is  a.synchronous  so  that 
MATE  docs  not  wait  for  the  action  invocation  to  return  before  executing  its  next  statement. 


Events.  In  the  asynchronous  action  call  that  we  just  discussed,  FIDU.rcad.donc  is  a  of  predefined 
RTC  type  called  event.  An  event  has  absolute  time  values  that  are  established  in  three  ways;  1)  a 
signal  statement  by  a  process  or  action,  which  causes  the  current  absolute  time  to  be  as.signed  to 
the  event  variable;  2)  a  clear  statement  by  a  process  or  action,  which  sets  the  event  variable  to  an 
infinite  absolute  time;  or  3)  the  run-time  system  signaling  an  event  associated  with  the  completion 
of  an  asynchronous  action  invocation  (such  as  FIDU.read.done). 

Timing  blocks.  /ZTC Timing  blocks  express  earliest  start  times  (after),  latest  start  times  (be¬ 
fore),  deadlines  (by),  and  periods  (every)  for  a  series  of  statements  using  liming  expressions 
involving  event  variables  and  relative  times.  A  construct  to  express  maximum  execution  time  (ex¬ 
ecute)  is  also  provided.  Exception  handlers  for  violations  of  many  of  the  constraints  expressed  by 
these  blocks  can  be  expressed.  There  are  several  timing  blocks  in  the  example  of  Figure  3.  One 
timing  block  states:  by  last-update  F  lOsec,  the  block  expresses  a  deadline  by  which  the  part  of  the 
MATE  process  shown  in  Figure  3  must  complete.  Near  the  bottom  of  figure  3  is  the  E.DEADLINE 
exception  handler  that  interrupts  the  constrained  statments  to  execute  if  the  deadline  is  violated. 
Another  example  of  a  deadline  is:  by  opJnpuV,  where  opJnput  is  a  global  event  signaled  by  the 
process  monitoring  the  operator’s  track  ball  input.  This  deadline  serves  to  interrupt  the  calculation 
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process  MATE 


event  last.updale,  FI  DU. read. done,  OP.read-done,  op.inpui 

output.calculated  =  FALSE 
by  last. update  +  lOsec  do 
guaranteed 

while  (loutput.calculated)  { 

before  (last.update  +  Ssec)  do 
by  opJnput  do 
exclusive 

action^  (FIDU-read.done)  FlDU.read  (FlDU.iiifo) 
action&c  (OP.read.done)  operator. read  (op. info) 
end  exclusive 

after  max  (FlDU.read. done,  OP.read.done)  do 
calculate  output-data 
output.calculated  =  TRUE 
except  j*  by  opJnput  */ 

when  E.DEADLINE  /*  op.input  */  do  clear  op.input  end  when 
end  do  /’*  by  opJnput  */ 
except  /*  before  last.update  +  Ssec  *  j 
when  E.START  do 

output.data  =  quick  update 
outpul.calcuiated  =  TRUE 
warn  operator  of  possible  MATE  malfunction 
end  when  /*  E.START  */ 
end  do  /*  before  last.update  +  Ssec  */ 

}  /*  end  while  */ 
no.except 

action  Display.output(output.data) 
signal  (last-output) 
end  no_except 
end  guaranteed 

except  /*  by  last.update  +  lOsec  */ 

when  E.DEADLINE  alert  operator  of  MATE  malfunction  end  when 
end  do  /*  by  last.update  +  lOsec  */ 


Figure  3:  Example  of  MATE  Process  with  /il’C  Constructs 
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of  the  output  data  wheu  new  input  data  arrives;  MATE  then  loops  back  and  starts  calculating  tju* 
output  data  again. 

Action  Invocation  Precedence  Ordering.  Fle.xible  e.\pression  of  precedence  constraints  (con 
currency)  within  processes  and  between  processes  is  supported  using  a  combination  of  events,  and 
synchronous  and  asynchronous  action  invocations,  along  with  earliest  start  time  constraints.  For 
instance,  concurrency  within  a  process  is  expressed  using  asynchronous  action  invocations  to  "fork" 
off  concurrent  action  invocations  such  as  FID U. read  and  operator. ixad  in  Figure  Then  later  in 
the  process,  the  earliest  start  time  constraint: 

after  max  (FIUD jead.done,  OP.read.done) 

is  used  to  “join”  execution  by  requiring  that,  at  this  point,  the  process  will  wait  for  the  two  read 
actions  to  complete.  By  expressing  fork  and  join  semantics,  RTC  allows  a  powerful  method  for 
expressing  concurrency  witliin  a  process.  Note  that  global  events  and  o/lcr  clauses  in  timing  blocks 
can  also  be  used  to  express  precedence  orderings  between  actions  in  different  processes. 

Exclusive  blocks.  The  constructs  begin  exclusive  -  end  exclusive  denote  a  .series  of  state¬ 
ments  where  all  action  invocations  in  the  scries  must  be  e,\ecuted  on  their  resources  without  in 
compatible  actions  (as  defined  by  the  resource  compatibility  construct  mentioned  above)  being 
executed  while  the  block  is  active.  In  the  example,  an  exclusive  block  is  used  to  ensure  that  no 
incompatible  action,  such  as  a  write  to  the  FIDU,  is  allowed  while  the  FIDU  and  operator  input 
are  being  read. 

Guaranteed  blocks.  The  constructs  begin  guaranteed  -  end  guaranteed  denote  a  series  of 
action  invocations  that  must  execute  witliout  delay  due  to  contention  for  resources.  In  the  example, 
the  set  of  actions  to  read  the  inputs,  calculate  the  output,  and  output  the  results  appear  within  a 
guaranteed  block.  Immediately  inside  this  guaranteed  block  there  is  a  timing  block  with  a  latest 
start  time  constraint  of  lasi.update  +  8sec.  In  this  part  of  the  process,  we  make  the  assumption 
that  the  actions  take  a  maximum  total  of  two  seconds  if  there  is  no  contention  for  resources.  By 
constraining  the  actions  with  a  latest  start  time  of  two  seconds  before  the  deadline  and  placing  them 
in  a  guaranteed  block  so  that  there  is  no  contention  for  resources,  we  know  that  the  actions  will  only 
start  if  they  will  complete  by  the  deadline  (barring  faults).  This  technique  allows  detecting  potential 
deadline  violations  early  so  that  they  can  be  avoided.  In  the  MATE  process,  if  the  actions  are  not 
started  early  enough,  the  E-START  exception  handler  can  generate  a  quick,  complete,  acceptable, 
but  les.';  accurate  display. 

No_except  blocks.  The  constructs  begin  no_except  -  end  no.except  block  exceptions  for 
certain  parts  of  the  process.  In  the  example,  the  call  to  Display. output  and  the  updating  of 
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the  associated  control  information  is  done  in  a  no.except  block.  This  is  done  so  that  a  deadlme 
exception  will  not  violate  the  atomic  properly  of  the  Display  device  nor  will  such  an  except hui 
leave  the  control  variable  of  the  MATE  process,  last^updati,  unsignaled 

2.3  Timed  Fault  Tolerance:  Support  for  Timed  Atomic  Behavior 

The  techniques  that  we  use  to  handle  timing  faults  are  based  on  atomic  action  paradigms  Uiai 
have  been  widely  used  to  support  reliability  in  non-real  time  concurrent  .systems  [6,  7j.  In  theve 
traditional  paradigms,  changes  to  system  resources  and  the  environment  are  performed  by  actions 
(sometimes  caUed  transactions)  with  the  property  that,  even  if  faults  occur,  either  1)  the  action 
completes  and  transforms  the  system  to  a  consistent  state,  or  2)  it  appears  that  the  action  did  not 
execute  and  the  system  is  left  in  an  original  consistent  state.  That  is,  atomic  actions  perform  all  or 
nothing  execution.  Unfortunately,  traditional  atomic  actions  only  require  that  all  actions  eventually 
decide  whether  to  execute  or  not  (7);  there  is  no  deadline  by  which  the  derision  and  action  must 
be  completed.  To  incorporate  the  timing  constraints  of  real-time  systems  into  the  consistency  that 
atomic  execution  seeks  to  preserve,  wo  would  like  to  modify  the  definition  of  atomic  behavior  to 
allow  all-or-nothing  execution  within  timing  constraints.  Since  meeting  this  criteria  is  provably 
impossible  when  faults  (8]  can  occur,  we  alter  our  definition  of  timed  atomic  behavior  further  to  be; 
timed  atomic  behavior  is  all-or-nothing  execution  within  timing  constraints  or  an  exception.  This 
definition  is  a  generalization  of  the  definition  of  timed  atomic  commitment  presented  in  [9). 

Specifying  and  Enforcing  All-or-Nothing  Behavior.  To  maintain  consistency  of  the  system 
it  is  often  necessary  to  specify  that  either  all  actions  in  a  set  execute  completely  or  none  of  them 
execute.  To  specify  that  all  statements  in  a  block  complete,  a  no.excepl  block  can  be  used  to 
delay  exceptions  until  after  the  statements  complete.  Specifying  the  “nothing”  alternative  involves 
ensuring  that  no  actions  are  executed  if  exceptions  are  possible  during  the  no.except  block.  This 
is  done  by  nesting  the  no.except  block  inside  a  guaranteed  block  as  the  finst  statement  of  a  timing 
block.  The  timing  block  specifies  a  latest  start  time  that  is  sufficiently  far  in  advance  of  the  deadline 
to  allow  the  statements  to  complete  when  there  is  no  contention  for  resources.  The  guaranteed  block 
ensures  that  there  is  no  contention  for  resources.  If  the  statements  are  not  started  by  this  latest 
start  time,  the  E_START  exception  is  raised  and  none  of  the  statements  start;  thus,  achieving  the 
“nothing”  alternative.  Note  that  the  programmer  must  know  the  maximum  execution  time  of  the 
statements  to  be  guaranteed  in  order  to  establish  this  latest  start  time.  This  technique  is  used  to 
ensure  the  atomic  update  of  the  display  resource  in  the  MATE  example  (See  Figure  3). 

While  this  expression  of  “atomicity”  is  somewhat  unconventional,  the  fact  that  real-time  control 
applications  directly  affect  the  environment  and  are  time-constrained  makes  traditional  atomic 
rollback  [7,  6]  impos-sible.  For  example,  if  an  action  moves  a  robot  arm  from  a  starting  position. 
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a  compensating  action  [10,  11]  can  bring  it  back  to  the  starting  position,  but  not  erase  the  fact 
that  the  move  was  performed  or  that  the  move  took  time.  Tims,  to  achieve  atomicity  in  a  rea! 
time  environment,  we  require  that  cither  all  of  the  constrained  statements  co!nplet<‘  once  they  ,if<* 
started,  or  that  none  of  them  start. 

Exception  Handling.  Since  faults  can  occur,  atomicity  can  not  be  guaranlwd;  H'J  (  ’  blocks 
therefore  allow  the  expression  of  exception  handling.  We  do  not  specify  what  the  recovery  at  tioiis 
are,  but  instead  provide  a  general  exception  handling  mechanism  so  that  programmers  can  express 
various  forms  of  recovery,  including  compensating  actions  [11,  10),  imprecise  comj)utalions  [T2|.  or 
other  forms  of  roll-back  or  roll-forward  techniques.  As  described  in  Section  2,  blocks  can  specify 
exception  handlers  that  interrupt  execution  for  violation  of  laslest  start  times,  deadlines,  maximum 
execution  times  and  simultaneous  execution.  Furthermore,  action  declarations  in  li'J'C  resources 
have  an  E-ABORT  exception  handler  that  becomes  ready  if  the  action  invocation  in  the  calling 
process  is  aborted.  This  exception  handler  allows  the  action  to  restore  the  resource  to  a  comsistent 
state;  perhaps  by  employing  a  compensating  action. 

The  execute  timing  block  constraint  be  used  to  express  a  form  of  early  detection  of  a  potential 
deadline  violation.  If  a  fault  causes  a  process  or  action  to  execute  too  long,  it  may  eventually  violate 
its  deadline.  However,  by  handling  the  execution  time  constraint  before  the  deadline  is  violated, 
the  deadline  violation  may  be  avoided.  Consider  the  RTC  code: 

execute  Ssec  by  NOW  -f-  lOscc  do 
action  r.a  (parains) 
except 

when  E.EXECUTE  do  early  recovery  actions  end  when 
when  E_DEADL1NE  do  recovery  actions  end  when 
end  do 

In  this  example,  if  a  fault  caused  action  r.a  to  execute  for  over  5  seconds,  then  the  E.EXECUTE 
exception  liandler  may  be  able  to  avoid  a  deadline  violation  that  otherwise  would  have  resulted 
from  the  fault. 

For  hard  deadlines,  which  may  be  so  critical  that  exception  liandling  can  not  restore  a  consistent 
state  if  the  deadline  is  missed,  the  RTC  constructs  can  be  used  to  specify  an  intermediate  deadline 
that  is  sufTicienlly  far  in  advance  of  the  actual  deadline  to  allow  for  recovery.  For  instance  if  a  set 
of  statements  S  ha.s  a  hard  deadline  D,  and  restoring  a  consistent  stale  from  partial  execution  of  S 
takes  r  maximum  execution  time  without  contention  for  resources,  the  following  timing  block  can 
improve  the  chances  of  being  in  a  consistent  state  at  D  (let  D'  =  D  -  r): 
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by  D  do 

by  D'  do 

statements  S 
except  /*  D'  violated  */ 

when  E-DEADLINE  do 
guarantee 

recovery  (r  lime) 
end  guarantee 
end  when 
end  do 

except  /•  D  violated  */ 

when  E. DEADLINE  do  emergency  actions  end  when 
end  do 


Although  this  technique  is  pessimistic  by  raising  exceptions  when  a  violation  may  not  occur  in 
actuality,  it  improves  the  chances  of  being  in  a  consistent  state  at  D. 

3  The  POSIX.4  Standard  Interface 

Our  approach  to  supporting  the  development  of  portable  real-time  software  is  to  implement  the  lifC 
run-time  system  on  an  architecture  that  adheres  to  the  proposed  IEEE  POSIX  1003.4a  standards 
for  real-time  operating  systems.  We  now  present  an  overview  of  these  proposed  POSIX  real-time 
standards. 

Since  1985  the  IEEE  POSIX  (Portable  Operating  System  Interface)  group  has  been  developing 
an  operating  system  interface  standard  [13]  with  the  aim  of  standardizing  the  software  interface 
over  various  architectures.  They  have  also  formed  several  subgroups,  including  one  assigned  to 
work  on  real-time  extensions,  now  called  1003.4  [14],  and  one  to  work  on  a  thread-based  extension, 
called  1003.4a  [15]^.  Although  these  extensions  are  not  officially  approved  standards  at  the  time 
of  this  writing,  government  agencies  and  industry  have  already  demonstrated  strong  support  for 
them.  For  instance,  NASA  has  mandated  tliat  the  software  for  the  navigation  system  on  its  space 
station  Freedom  be  POSIX.4  compliant. 

Some  of  the  features  of  particular  importance  in  the  POSIX.4  standard  are: 

•  Threads  -  A  POSIX.4  process  is  a  unit  of  allocation  for  memory  and  devices.  POSIX.4  de¬ 
composes  a  process  into  threads,  each  of  which  is  a  flow  of  control  with  its  own  program 
counter  and  run-time  stack.  In  contrast  to  a  Unix  system  where  processes  have  a  single 
thread,  POSIX.4  allows  programmers  to  take  advantage  of  inherent  parallelism  in  applica¬ 
tions.  Threads  have  minimal  private  state  because  they  share  state  with  all  threads  in  the 
process.  Threads  can  be  scheduled  individually,  or  a  process  can  be  scheduled  and  then  locally 
determine  which  of  its  threads  is  to  execute. 

^Since  this  paper  is  concerned  with  real-time,  we  will  use  “POSIX.4’’  to  mean  that  both  of  the  POSIX  1003.4  and 
POSLX  1003.4a  standards  apply. 
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•  Real-Time  Scheduling  Interface  -  POSIX.4  rcquirej  scheduling  queues  for  a  miiiintuiii  of  32 
priority  levels,  where  threads  from  a  higher  priority  queue  are  scheduled  {preemplivelv)  before 
threads  from  lower  priority  queues.  Each  thread  may  specify  its  priority  and  whether  it  is  to 
be  scheduled  round  robin  or  FIFO  (round  robin  with  infinite  quantum). 

•  Signals  -  P0S1X.4  requires  at  least  32  different  signals.  P0S1X.4  signals  are  asynclironous 
messages  between  threads  that  contain  very  little  data.  Threads  may  send  a  signal,  may  wait 
on  the  arrival  of  a  signal,  and  may  have  an  asynchronous  signal  handler  invoked  upon  arrival 
of  a  signal.  POSIX.4  also  supports  real-time  signals  that  can  be  queued  at  the  receiver. 

•  Timers  -  POSIX.4  requires  for  system  clocks  and  per-thread  timers  that  can  measure  absolute 
and  relative  times  on  the  order  of  nanoseconds.  Timers  can  be  “one-shot”  or  periodic,  and 
can  use  either  relative  or  absolute  time;  they  notify  threads  of  their  expiration  via  signals. 

•  Shared  Memory  and  Mapjyed  Files  -  POSIX.4  allows  the  .specification  of  main  memory  regions 
that  can  be  mapped  into  the  address  space  of  multiple  processes,  allowing  multiple  processe.s 
to  share  memory.  When  one  of  the  sharing  processes  writes  to  memory,  all  of  the  sharing 
processes  can  read  it,  thus  allowing  efficient  sharing  of  data  without  explicit  communication. 

•  Message  Passing  -  POSIX.4  requires  a  message  passing  facility  that  includes  provisions  for 
queuing  messages  and  .several  means  of  retrieving  them.  Message  queues  are  system  resources 
that  are  allocated  to  processes.  A  process  can  establish  various  attributes,  such  as  queue 
length,  for  the  queues  that  it  owns.  Asynchronous  sends  and  receives  of  messages  are  sup¬ 
ported. 

•  Binary  Semaphore  and  Mutex  -  POSIX.4  requires  binary  semaphores  as  a  means  of  synchro¬ 
nizing  threads  of  different  processes.  It  also  requires  Mutexes  and  condition  variables  as  a 
means  of  synchronizing  threads  within  in  a  single  process.  Mutexes,  like  semaphores,  provide 
mutual  exclusion  among  threads.  However,  mutexes  can  take  advantage  of  the  fact  that  they 
are  local  to  a  single  process  and  can  be  implemented  in  the  shared  memory  of  that  process. 
Mute.xes  and  condition  variables  can  be  implemented  with  priority  inheritance  protocols  [IG]: 
this  form  of  implementation  is  optional,  but  has  been  shown  to  be  beneficial  for  preserving 
predictability  in  real-time  systems  [4]. 

These  facilities  are  described  in  terms  of  a  common  interface  that  must  be  provided  to  applications. 
An  exact  implementation  is  not  specified  and,  in  fact,  can  vary  as  long  as  the  required  interface 
is  provided.  For  instance,  nothing  precludes  the  architecture  underlying  the  POSIX.4  interface 
from  being  a  real-time  architecture,  such  as  SIFT  [1]  or  MAFT  [2],  that  includes  traditional  fault 
tolerance  techniques  such  as  redundancy  and  voting. 
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The  P0SIX.4  standard  developers  did  not  seek  to  describe  every  capability  that  should  be 
provided  by  a  real-time  operating  system,  but  instead  provided  a  platform  for  building  desired 
features.  Therefore,  it  is  not  surprising  that  the  capabilities  de.scribed  by  tli<*  .standard  alum' 
have  deficiencies  for  supporting  fault  tolerant  concurrent  real-time  systems.  In  particular,  the 
standard  does  not  require  integrated  support  of  real-time  concurrency  and  rebability  constraints 
such  as;  absolute  timing  constraints,  exclusive  execution  constraints,  and  timed  atomic  behavior. 
For  instance,  P0SIX.4  supports  priority-driven  preemptive  scheduling  that  ran  be  used  to  meet  the 
timing  constraints  of  processes,  but  the  arbitrary  preemption  ignores  the  consistency  constraints 
of  resources.  On  the  other  hand,  P0SIX.4  supports  mutual  exclusion  techniques  to  maintain 
the  consistency  of  shared  resources,  but  these  techniques  disallow  potential  concurrency  and  ignore 
timing  constraints.  Furthermore,  the  P0SIX.4  standards  also  do  not  directly  require  fault  tolerance 
support. 

Although  the  P0SIX.4  standard  has  these  important  deficiencies,  we  will  show  next  that  the 
capabilities  mandated  by  the  standard  are  sufficient  for  implementing  the  /^TC  run-time  system 
which  can  support  the  development  of  real-time  software  that  meets  timing,  concurrency,  and 
reliability  requirements.  Since  it  appears  that  the  P0SIX.4  standard  will  soon  bo  widely-used, 
building  such  a  software  system  using  the  P0SIX.4  standard  should  also  enhance  portability  and 
re-usability  of  RTC  based  software  that  is  developed. 

4  Implementing  RTC  on  POSIX.4  Compliant  Systems 

The  constraints  expressed  by  tlie  /ZTC language  constructs  are  enforced  by  a  run-time  system  built 
on  the  commercial  Lynx  [IG]  operating  system,  which  is  POSIX.4  compliant.  The  RTC  run-time 
system  consists  of  a  set  of  managers  each  of  which  is  implemented  using  a  POSIX.4  process  and  one 
or  more  POSIX.4  threads.  Each  /ZTC  process  is  managed  by  its  own  process  manager  (QM),  each 
RTC  resource  has  a  single  resource  manager  (RM);  and  each  processor  heis  a  processor  manager 
(PM)  which  is  used  to  reserve  processors  for  guaranteed  executions  of  actions.  There  is  also  a 
centralized  event  manager  (EM)  which  interacts  with  process  and  resource  managers  to  implement 
RTC  events. 

Scheduling,  The  Lynx  operating  system  provides  for  scheduling  queues  at  256  priority  levels 
which  contain  threads  from  all  Lynx  processes.  Since  reliable  systems  should  be  dynamic  to  allow 
for  faults  and  other  unforeseen  occurrences  in  the  environment,  we  wish  to  use  a  dynamic  schedul¬ 
ing  algorithm  in  the  run-time  system.  Preemptive  Earliest-Deadline-First  (EDF)  scheduling 
algorithms  have  been  shown  to  be  optimal  for  meeting  timing  constraints  in  dynamic  systems  [17). 
Simulating  earlie.st  deadline  first  scheduling  with  a  priority-based  system  is  possible  [IS],  but  re 
quires  infinite  priorities  (a  priority  for  every  possible  deadline).  We  are  currently  investigating  the 
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best  assignment  of  threads  to  priority  queues  tliat  best  supports  EDF  scheduling.  This  assignment 
seeks  to  minimize  the  maximum  lengtli  that  any  queue  will  reach  because  the  maximum  queue 
length  is  used  to  quantify  the  loss  of  performance,  compared  to  the  optimal  EDF  scheduling,  that 
our  RTC/Lynx  system  incurs.  In  addition  to  an  original  queue  assignment,  each  thread  must  in¬ 
crease  priority  as  its  deadline  nears.  The  exact  details  of  this  scheduling  strategy  are  still  being 
investigated. 

Run-Time  Support  for  Timing  Blocks.  To  enforce  the  after  construct  of  a  timing  block  in 
RTC process  q,  the  process  manager  task  for  q,  QMq,  suspends  q  and  uses  the  P0SIX.4  timer  and 
signal  capabilities  to  request  a  signal  at  the  earliest  start  time.  When  the  signal  arrives,  process 
q  re-activates.  To  enforce  the  before  construct  of  a  timing  block  in  process  q,  QMg  requests  that 
a  P0SIX.4  timer  send  a  signal  at  the  the  latest  start  time.  An  exception  handling  thread  waits 
for  the  signed.  If  the  signal  arrives  before  process  q  executes  the  statments  of  the  timing  block, 
the  exception  handler  thread  is  activated  and  it  deactivates  the  thread  for  process  q.  Otherwise,  if 
process  q  starts  *^^he  statements  of  the  timing  block,  process  manager  QMq  deactivates  the  exception 
handling  thread  and  removes  the  timer  signal  request.  The  by  deadline  construct  of  a  timing  block 
in  process  q  is  implemented  using  a  stack  to  keep  track  of  current  deadline.  As  nested  timing 
blocks  are  entered  by  process  q,  QMq  pushes  the  tighter  deadlines  on  the  slack;  as  the  timing 
blocks  complete,  QMq  pops  the  deadline  from  the  stack.  Process  manager  QMq  uses  the  deadline 
on  the  top  of  the  stack  to  determine  the  scheduling  priority  of  the  process,  and  to  set  a  timer  signal 
to  indicate  deadline  violations.  Again,  a  separate  thread  is  used  to  wait  for  the  deadline  signal  and 
then  perform  exception  handling. 

Resource  and  Processor  Management.  To  ensure  correct  execution  of  an  action,  we  need 
the  ability  to  block  incompatible  action  invocations.  P0SIX.4  provides  semaphores  and  mutexes 
which  block  incompatible  actions,  as  well  as  all  concurrent  actions,  by  using  mutual  exclusion. 
However,  these  mutual  exclusion  techniques  disallow  many  concurrent  accesses  of  a  resource  that 
would  not  violate  consistency  and  as  such  they  reduce  utilization  that  could  be  valuable  in  meeting 
timing  constraints.  Instead  of  mutual  exclusion,  the  RTC  resource  managers  allow  more  potential 
concurrency  through  semantic  concurrency  control  [19].  This  concurrency  control  mechanism  uses 
action  locks  at  both  the  resource  and  processor  level.  At  the  resource  level,  if  an  action  is  locked, 
then  compatible  actions  may  execute,  but  no  actions  that  are  incompatible  with  the  locked  action 
may  be  executed  until  the  lock  is  released.  A  process  manager  QMq  requests  action  locks  from  a 
resource  manager  RMr  by  sending  a  P0SIX.4  message  specifying  the  set  of  actions  that  process  q 
wishes  to  lock  on  resource  r,  {oi, . . .  ,g„}.  RMr  grants  the  request  only  if  each  action  in  {oi , . . . .  n,,} 
is  compatible  with  re.source  r’s  currently  held  action  locks  and  pending  requests  of  higher  priority. 
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If  RAdr  does  not  grant  a  resource  luck  request,  it  queues  the  request  based  on  process  ^’s  priority. 
Resource  managers  use  priority  inheritance  [4]  by  setting  the  priority  of  all  currently  executing 
threads  for  actions  that  are  incompatible  with  the  requested  action  lock  to  at  least  the  priority  of 
the  requesting  process. 

Process  manager  QMq  can  also  request  action  invocations  from  an  RM  by  sending  the  request 
in  a  P0SIX.4  message.  When  RMr  gets  an  action  invocation  request  a  from  QMg  that  has  not 
been  locked,  it  must  first  lock  the  action.  Once  a  lock  is  held  for  the  action,  RMr  creates  a  P0SIX.4 
thread  t  for  action  a  and  grants  t  access  to  the  data  of  resource  r  through  the  P0SIX.4  shared 
memory  facilities.  If  process  manager  QMq  holds  a  processor  lock,  RMr  creates  t  with  highest 
priority;  otherwise,  RMr  creates  t  with  (requesting)  process  q’s  priority.  If  process  q's  action 
invocation  is  synchronous,  process  manager  QMg  suspends  q  while  waiting  for  return  parameters 
from  t;  if  the  action  invocation  is  asynchronous,  q  is  not  suspended.  When  action  invocation  thread 
t  completes,  t  sends  the  action’s  return  parameters  to  a  thread  of  the  process  manager  QMg  in  a 
P0SIX.4  message.  This  thread  shares  memory  with  process  q  so  that  it  can  accept  the  returned 
parameters  and  update  their  values  in  process  q's  state. 

Meeting  Constraints.  To  ensure  exclusive  execution,  a  process  manager  must  obtain  action 
locks  for  all  action  invocations  in  an  exclusive  block  before  it  invokes  any  action  invocations  in  the 
block.  The  action  locks  must  be  held  until  all  action  invocations  in  the  block  have  completed.  In 
this  way,  no  action  invocation  that  is  incompatible  with  any  action  invocation  in  the  exclusive  block 
will  overlap  the  execution  of  the  block.  To  ensure  guaranteed  execution,  a  process  manager  must 
obtain  action  locks  and  a  processor  lock  for  all  actions  invoked  in  the  guaranteed  or  simultaneous 
block  before  it  invokes  any  action  in  the  block.  All  locks  are  released  when  the  block  completes.  The 
action  locks  ensure  that  no  action  invocation  of  the  block  is  queued  by  an  RM.  The  processor  locks 
ensure  that  the  action  invocations  e.xecute  on  their  assigned  processors  when  the  action  invocations 
are  ready  and  that  the  action  invocations  are  not  preempted. 

5  Conclusion 

This  paper  has  described  how  the  7ZTC programming  environment  can  be  used  to  naturally  express 
real-time  concurrency  constraints  in  a  C  program  so  that  the  run-time  system  will  enforce  them  on 
a  P0SIX.4  compliant  architecture.  This  explicit  constraint  expression  simplifies  the  development  of 
the  application  software  compared  to  the  “expression”  of  constraints  found  in  Ada.  The  TiTC  run¬ 
time  system  provides  concurrency  control  through  a  locking  mechanism  that  ensures  the  consistency 
of  the  shared  resources.  This  mechanism  potentially  allows  more  concurrency  with  consistency  than 
P0SIX.4’s  and  Ada’s  mutual  exclusion  capabilities.  Once  the  locking  mechanism  determines  which 
actions  of  a  resource  can  execute,  the  run-time  system  employs  the  P0SIX.4  preemptive  priority- 
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based  real-time  scheduling  for  all  threads  (both  actions  and  processes)  in  the  system.  I'lms,  ilm 
system  integrates  concurrency  control  and  real-time  scheduling.  Moreover,  the  impleinetitation  of 
the  RTC  system  using  the  IEEE  P0SIX.4  standard  for  architecture  interfaces  supports  portability 
and  re-usability  of  software  across  architectures  that  will  ease  application  development,  do  handle 
timing  faults,  we  also  showed  how  the  RTC  constructs  and  run-time  system  can  bo  used  to  enforce 
the  requirement  (at  the  software  level)  that  either  all  of  the  constrained  statements  execute  witliiii 
timing  constraints,  or  that  none  of  them  start.  If  faults  occur,  they  are  detected  through  exceptions 
and  as  exceptions. 

A  drawback  of  the  RTC  approach  is  the  overhead  incurred  due  to  the  managers  in  the  RTC 
run-time  system.  Timing  measurements  of  our  preliminary  implementation  can  be  found  in  [5]. 
However,  it  is  our  belief  that  predictable  performance  is  often  more  important  than  speed  in  a 
real-time  system  that  must  be  reliable  [20,  21|. 

We  have  used  the  /iTC  constructs  to  program  distributed  robotics  applications  that  have  real¬ 
time  concurrency  constraints,  such  as  t%vo  arms  that  must  coordinate  to  lift  a  moving  object  witliin 
timing  constraints  [5].  We  are  currently  implementing  the  RTC/POSIX  system  and  developing 
submarine  applications,  including  MATE. 

Acknowledgments.  We  thank  Susan  Davidson  and  Insup  Lee  of  the  University  of  Pennsylvania 
who  were  instrumental  in  the  development  of  the  /?TC  constructs. 
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ABSTRACT 

This  paper  describes  the  ongoing  effort  in  the  Submarine  Launched 
Ballistic  Missile  (SLBM)  Software  Development  Division  (K50)  to 
establish  a  metrics  program  to  cover  the  entire  software  life  cycle 
process.  This  program  will  provide  assessment  information  on  both  the 
division's  software  process  and  its  products.  In  addition,  this  metrics 
program  will  provide  a  common  terminology  across  the  organization  and  a 
metrics  database  for  all  levels  of  the  organization  for  present  and 
future  endeavors.  This  paper  describes  the  objectives  of  this  program, 
the  approach  used  to  identify  a  set  of  candidate  metrics,  and  the 
selected  set . 


1.  Introduction 

To  understand  the  role  of  a  metrics  program  within  the  SLBM  Software 
Development  Division  (K50)  it  is  necessary  to  understind  its  mission  and  the 
types  of  software  products  it  develops.  The  basic  mission  of  this  division 
is : 


"To  design,  develop,  produce,  and  maintain  the  shipboard  Fire  Control 
Programs,  Data,  and  Documentation  for  all  submarine  launched 
ballistic  missile  systems,  the  associated  support  software  (ex. 
compiler,  lin)<;er,  loader,  etc.)  and  the  Sgltwate  .fOI  ef£e.CLive.  largeling 
of  the  weapon  system." 

Notice  from  this  mission  statement  that  this  division  deals  with  three  types 
of  software:  fire  control,  support,  and  targeting.  Each  poses  both  similar 
and  unique  problems  in  the  various  phases  of  the  life  cycle.  All  follow  the 
fundamental  waterfall  chart  for  life  cycle  development  and  all  follow  a  basic 
software  framework  that  was  established  within  the  division.  In  addition, 
all  three  types  have  an  established  change  control  process  as  part  of  an 
overall  configuration  master  plan.  Each  change  control  board,  however,  has 
slightly  different  software  change  control  forms  and  problem  reporting 
mechanisms.  Each  type  of  software  follows  a  different  standard  for  format 
and  each  type  is  developed  by  a  different  organization  within  the  division. 


In  November  ox  1991,  the  division  head,  as  part  of  an  overall 
strengthening  of  our  software  development  process,  directed  that  a  formalized 
metrics  program  be  developed  within  the  division  that  would  cover  all  the 
different  types  of  software  and  life  cycle  phases.  He  felt  that  "metrics" 
were  being  used  in  the  software  process  and  "kept"  for  historical  parposes, 
but  to  varying  degrees  for  the  different  types  of  software  and  life  cycle 
phases.  For  the  initiation  of  this  metrics  program,  four  basic  objectives 
were  established: 

1.  Identify  all  the  metrics  currently  collected  both  informally  and 
formally  within  the  division  for  all  the  software  types. 

2.  Determine  whether  the  metrics  currently  collected  are  being 
effectively  used  for  both  assessment  and  control  of  our  software 
process.  If  they  aren't,  develop  a  plan  to  accomplish  that. 

3.  Determine  what  additional  metrics  need  to  be  gathered  and  the 
implementation  mechanism  for  areas  in  the  process  that  need 
strengthening . 

4.  Develop  a  common  terminology  for  metrics  within  the  division,  so 
that  for  a  given  metric  there  is  an  accepted  definition/usage  across  the 
organization  no  matter  what  the  type  of  software. 

From  these  basic  objectives  and  the  structure  of  the  division,  some 
additional  objectives  were  identified  upon  which  the  metrics  program  would  be 
built.  In  the  establishment  of  a  program  of  this  nature,  a  critical 
requirement  is  to  form  the  objectives  of  the  program  before  identifying  the 
metrics  to  support  it.  One  does  not  simply  collect  data  without  any 
objectives  in  mind!  These  additional  objectives  included: 

a.  Concentrate  on  quality  rather  than  productivity  type  metrics.  We 
were  concerned  with  both  the  quality  of  our  development  process  and  the 
resulting  products.  These  metrics  were  to  support  the  overall  software 
process  and  management  decision  making  at  all  levels. 

b.  Use  itive  implementation  approach.  Implement  and  evaluate  a 

few  metriv..  a  time  rather  than  establish  a  large  data  collection 
process  requiring  a  lot  of  resources  in  time  and  personnel.  We  wanted 
to  show  the  benefits  of  such  a  program  and  get  support  for  it  by  all 
levels.  The  best  way  to  do  this  was  to  demonstrate  the  benefit  with  a 
few  initial  metrics,  gain  acceptance  with  these,  and  then  implement  more 
in  an  iterative  fashion. 

c.  Involve  all  personnel  in  the  development,  implementation,  and 
analysis  of  the  metrics  program.  The  primary  reasons  such  programs  fail 
is  a  lack  of  commitment  by  the  various  levels  of  the  organization.  If 
all  levels  are  involved  in  the  shaping  of  the  program,  you'll  take  a  big 
step  in  gaining  support  for  it. 


2 .  Approach 

Our  approach  to  establishing  a  metrics  program  was  threefold: 
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1.  Research  work  efforts  within  the  division  and  to  study  software 
metrics,  their  applications  and  benefits  to  the  overall  software  development 
process . 

2.  Establish  a  Metrics  Study  Group  (MSG) . 

3.  Implement  a  well-defined  metrics  selection  process. 

2.1  Research  Work  Efforts 

The  initial  groundwork  for  the  metrics  program  was  laid  by  the  authors. 
Particular  areas  to  be  researched  were  identified.  These  areas  included  work 
efforts  accomplished  within  each  branch  and  by  the  staff,  how  the  waterfall 
software  development  process  facilitates  these  work  efforts,  and  how  software 
metrics  could  be  applied  to  these  efforts. 

2.2  Establish  Metrics  Study  Group  (MSG) 

It  was  apparent  that  two  people  could  not  effectively  accomplish  all 
these  tasks  in  a  timely  manner.  The  division  head  asked  that  each  branch 
appoint  a  person  to  become  a  member  or  point  of  contact  (POC)  for  the  Metrics 
Study  Group  (MSG) .  Once  the  group  members  were  appointed,  the  next  step  was 
to  define  MSG  tasks,  as  well  as  POC  responsibilities. 

The  MSG  set  the  following  as  their  tasks: 

1.  Define  objectives  for  the  metrics  program 

2.  Research  metrics  concepts/applications 

3.  Identify  areas  of  the  software  development  process  to  which  metrics 
should  be  applied 

4.  Define  a  specific  set  of  metrics  to  be  collected  and  implemented  and 
develop  a  written  metrics  plan,  to  be  updated  as  needed 

POCs  were  chosen  such  that  all  three  principal  work  areas  were  well- 
represented.  There  were  individuals  representing  the  development  of  fire 
control,  support,  and  targeting  software.  Additional  areas  covered  were  data 
and  documentation. 

Also,  each  POC  chosen  had  an  understanding  of  how  their  organization 
operated  within  the  software  development  process.  They  were  not  new  employees 
just  establishing  themselves  within  the  organization. 

POC  responsibilities  were  defined  in  two  parts.  Initial  tasks  were 
assigned  to  get  the  metrics  program  in  place  and  then  additional  tasks  were 
defined  once  the  metrics  were  being  collected/analyzed. 

The  POCs'  initial  tasks  were  to  gain  an  understanding  of  software 
metrics  and  how  they  apply  to  their  branch's  work  efforts.  Several  training 
sessions  were  held  and  reading  material  was  given  out  and  discussed  at  group 
meetings.  Next,  each  POC  set  up  an  interview  session  with  the  authors  and 
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respective  selected  key  branch  personnel.  The  purpose  of  this  meeting  was  to 
uncover  any  metrics  that  the  branch  might  already  be  collecting  and  using  and 
to  discover  the  areas  in  the  development  process  that  needed  strengthening. 
Since  the  average  branch  member  attending  these  meetings  had  not  had  any 
formal  training  in  software  metrics,  a  basic  presentation  on  software  metrics 
was  given  by  the  authors  to  tip-off  the  interview  meeting.  After  the 
presentation,  previously  formulated  discussion  items  were  given  to  the  meeting 
attendees  to  help  stimulate  discussion. 

The  MSG  then  analyzed  the  interview  results  from  all  branches  and 
pointed  out  commonalities  that  occurred  across  branches  within  the  division. 
Common  areas  were  used  as  a  starting  point  to  define  a  candidate  set  of 
metrics  which  would  be  collected/analyzed  as  a  part  of  the  metrics  program. 

Once  the  candidate  metrics  are  finalized,  the  POCs'  task  will  be  to  set 
up  data  collection  procedures  within  the  respective  branches.  They  will  serve 
as  focal  points  within  the  branches  and  the  division  on  the  data  being 
collected.  They  will  monitor  data  being  collected  and  report  to  the  MSG  the 
effectiveness  of  this  process.  Analyzing  data  collected,  reporting  progress 
to  management,  training  branch  members  (metrics  awareness)  about  software 
metrics  usage,  and  continuing  to  learn  about  software  metrics  are  all 
important  on-going  responsibilities  of  the  POC. 

2.3  Metrics  Selection  Process 

The  metrics  selected  to  be  a  part  of  the  program  were  chosen  based  on 
two  inputs.  First,  a  prioritized  list  of  areas  in  which  to  collect  metrics 
was  determined  through  the  interview  process.  Second,  a  specific  set  of 
selection  criteria  was  defined  by  the  MSG.  This  was  based  upon  impact,  what 
other  organizations  were  doing,  and  overall  quality  objectives  of  the 
division . 

Branch  interview  sessions  brought  to  light  several  key  points  taken  into 
consideration  in  selecting  the  candidate  metrics.  It  was  clear  that  metrics 
were  presently  being  gathered  within  the  organization,  but  in  a  very  ad  hoc 
fashion.  Little  effort  was  being  made  to  formally  record  past  information 
which  could  be  utilized  to  aid  in  making  future  decisions.  Next,  some 
branches  used  metrics  more  than  others  due  to  the  nature  of  their  work 
efforts.  A  third  key  point  was  that  sizing  and  scheduling  were  common  themes 
in  most  work  areas.  Finally,  even  though  the  division  maintains  a  lot  of 
different  sets  of  computer  software,  few  maintenance  metrics  were  being 
gathered. 

Six  specific  areas  targeted  for  metrics  were  then  defined.  These  were: 

1.  Scheduling,  sizing,  development  manning 

2.  Better  utilization  of  Configuration  Management  board  information 

3.  Maintenance/reusability  metrics 

4.  Design  metrics  such  as  complexity,  size,  speed,  and  modularity 

5.  Requirements  development  and  stability 

6.  Testing  metrics;  unit/modular,  inte  rated,  system,  and  quality 
assurance 

In  conjunction  with  this  prioritized  list,  the  MSG  also  read  and 
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discussed  other  organizations'  metrics  programs.  As  each  area  o£  interest  was 
discussed,  each  of  the  following  organizations  was  researched  for  which 
metrics  they  collected  within  that  area: 


a.  Army,  Air  Force,  Navy 

b.  Other  major  organizations  at  NSWCDD 

c.  Professional  Societies:  IEEE,  AIAA 

d.  IBM* 

e.  HP* 

f.  MITRE 

g.  NASA 

*Please  note  that  IBM  and  HP  are  two  organizations  noted  by  the  Software 
Engineering  Institute  as  having  a  level  5  rating  for  the  Software  Process 
Assessment  Maturity  Level.  Having  a  rating  of  5  indicates  a  very  strong  and 
mature  metrics  program. 

Twelve  quality  factors  are  defined  by  [RADC,83]  to  improve  the  overall 
quality  of  software.  Five  of  these  quality  factors  were  chosen  as  especially 
applicable  to  the  work  efforts  within  the  division.  The  five  quality  factors 
chosen  were  Flexibility,  Maintainability,  Reusability,  Testability,  and 
Reliability.  The  metrics  selected  were  chosen  to  help  improve  the  software 
development  process  based  on  these  quality  factors.  These  particular  quality 
factors  were  chosen  due  to  the  maturity  of  the  division's  software  products. 


Another  important  concern  in  developing  a  metrics  program  is  the  impact 
that  it  will  have  on  the  organization.  A  successful  metrics  program  depends 
on  the  quality  of  data  being  collected.  Asking  personnel  to  change  their  way 
of  doing  their  everyday  job  must  be  approached  carefully. 


The  following  criteria  were  also  considered  when  the  candidate  metrics 
were  selected: 


1.  Data  availability 

2.  Implementation  time 

3.  Required  tools 

4.  Required  training 

5.  Necessary  changes  to  the  division's  software  development 
process,  including  any  changes  to  CM  board  policies  and 
procedures 


3.  Proposed  Metrics 


This  section  lists  the  candidate  metrics  that  have  been  proposed  based 
on  the  selection  criteria  and  the  work  that  the  MSG  did  as  discussed  in  the 
last  section.  Table  1  gives  a  listing  of  those  metrics,  a  brief  definition 
and  a  group  designation  for  the  order  in  which  the  metrics  will  be 
implemented.  Group  1  metrics  were  defined  to  be  implemented  within  the  first 
few  months  of  establishing  the  metrics  program.  Groups  2  and  3  metrics  will 
be  implemented  at  later  dates.  As  part  of  the  final  metrics  plan  that  is 
being  developed,  each  metric  will  have  the  following  sections  describing  it: 
Definition,  Benefit,  Implementation,  Quality  Factor,  and  Type.  The  Definition 
will  give  the  exact  method  for  calculating  the  statistic.  The  Benefit  section 
will  give  examples  of  how  the  metric  can  be  employed  and  the  resulting  value. 
The  Implementation  section  will  discuss  how  the  metric  is  to  be  collected,  how 
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often,  who  will  collect  it,  who  will  analyze  it,  how  it  is  to  feed  back  into 
the  process,  and  what  tools  may  be  needed  for  in^lementat ion .  The  Quality 
Factor  section  will  relate  what  Quality  Factor  the  metric  is  associated  with, 
while  the  Type  section  will  relate  what  level  the  metric  will  be  of  benefit  to 
(management,  branch,  or  the  software  process) . 

It  is  intended  that  the  list  will  be  dynamic  in  nature.  If  a  metric 
does  not  provide  the  benefit  or  information  that  is  expected,  other  ones  will 
be  considered.  If  there  are  areas  in  the  software  development  process  that 
need  additional  metrics  (it  is  anticipated  that  the  requirements  and  design 
phases  will  fall  in  this  category),  new  metrics  will  be  added  to  this  list.  A 
good  metrics  program  must  be  adaptable  to  a  changing  environment  in  order  to 
be  a  viable  part  of  the  process. 

Table  1.  Candidate  Metrics 


METRIC  NAME 

METRIC  DEFINITION 

GROUP 

KSLOC 

Total  nmrfjer  of  executable  lines  divided  by 

1000  . 

1 

KTLOC 

Total  number  of  all  lines  divided  by  1000  . 

1 

SLOCMOD 

Total  number  of  modules. 

1 

NUMREO 

Number  of  people  assigned  to  a  project  per 
month. 

1 

PERMOD 

%  of  modules  that  have  changed. 

1 

#0F  PR'S/PROGRAM 

Number  of  problem  reports  (PR's)  submitted 
against  a  program. 

1 

DEFECT  DENSITY 

Number  of  Software  errors  per  KSLOC. 

1 

PR  STATUS 

NurtJoer  of  'OPEN', 'ANSWERED' ,  and  'RESOLVED' 
problem  reports  under  configuration 
management . 

1 

REUSEHOD 

%  of  modules  that  have  been  carried  over  from 
other  software. 

2 

PMU 

Program  memory  utilization. 

2 

MODCHG 

%  of  modules  that  have  changed  from  one 
baseline  to  the  next. 

2 

DEFECT  DENSITY/PHASE 

Number  of  software  errors  per  KSLCXT  per  phase 
of  life  cycle. 

2 

ERROR  URGENCY 

%  of  software  problems  requiring  an  irnnediate 
fix  over  total  number  of  software  changes. 

2 

#0F  PR'S  RESULTING  IN  SCP 

Ratio  of  number  of  PR's  to  Software  Change 
Proposal's  (SCP’s) 

2 

ERROR  CLASSIFICATION 

Kind  of  software  error  that  was  made. 

2 

ERROR  SEVERITY 

Level  of  inpact  of  the  software  error. 

2 

TESTCASE  CLASSIFICATION 

How  the  error  was  found. 

2 

DEPENDENCE 

%  of  modules  that  call  library,  OS,  or  system 
rout ines . 

2 

CHANGE  DENSITY 

Ratio  of  number  of  SCP's  over  KSLOC. 

2 

NUMDOC 

Number  of  documents  associated  with  a  given 
program. 

2 

DATABASE  DEPENDENCE 

\  of  modules  with  database  references. 

2 

FDIM 

%  of  errors  introduced  during  the  maintenance 
phase. 

2 

PDIE 

%  of  errors  resulting  from  enhancements  to  a 
program. 

2 

KSLOC/NUMPEO 

Number  of  thousands  of  lines  of  code 
(developed,  tested,  etc.)  per  person  per  month 

3 

COUPLING 

%  of  modules  that  call  other  modules. 

3 
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TTF 

Time  to  find  and  fix  an  error. 

3 

TTI 

Time  to  identify  the  source  of  the  error. 

3 

TTC 

Time  to  determine  a  fix  for  the  error. 

3 

TTV 

Time  to  test  the  fix  for  the  error. 

3 

PMOCS 

%  of  modules  violating  coding  standards. 

3 

SCP  REASON 

The  reason  why  a  software  change  proposal  is 

necessary- 

3 

KPOL 

Number  of  design  lines  for  a  program  in  the 
design  phase. 

3 

#  OF  DOCUMENTATION  PAGES 

Number  of  pages  within  a  specified  document 

3 

REQUIREMENTS  TRACEABILITY 

Ratio  of  number  of  requirements  at  completion 
to  number  started  with. 

3 

#  OF  REQ.  CHANGES/PHASE 

The  number  of  requirement  changes  to  the  life 
cycle. 

3 

This  set  has  been  selected  based  upon  the  objectives  and  stiucture  of 
the  division  and  hence  may  not  be  applicable  in  its  entirety  for  all  programs. 
An  organization  must  tailor  its  program  based  upon  its  resources,  needs,  and 
objectives . 

4 .  Future  Plans 

The  next  step,  after  finalizing  the  metrics  set,  is  tc  develop  the 
implementation  aspects  of  the  metrics  plan.  Specific  issues  include  who  will 
collect  the  data,  how  often,  who  will  analyze  the  data,  what  results  are  to  be 
reported,  and  what  data  collection  mechanism  is  to  be  employed.  Also  to  be 
considered  are  the  determination  of  the  effectiveness  of  such  a  metrics 
implementation  and  what  other  metrics  are  to  be  considered  for  incorporation 
into  the  plan.  Since  the  requirements  and  design  phases  were  deemed  important 
in  both  the  conducted  interviews  and  the  overall  objectives  of  the  division, 
more  metrics  in  these  areas  need  to  be  identified. 

This  identification  of  specific  areas  and  supporting  metrics  will  be  a 
continuing  process.  Once  the  division  becomes  accustomed  to  collecting  and 
utilizing  a  specific  set  of  metrics,  then,  as  part  of  the  iterative  process, 
an  additional  set  will  be  identified.  The  role  of  the  MSG  group  in  the  future 
will  be  to  help  in  this  identification  process  and  in  the  implementation  of 
the  supporting  metrics. 

5 .  Summary 

In  this  paper  we  have  described  the  effort  undertaken  within  the  SLBM 
Software  Development  Division  to  strengthen  its  overall  development  process 
for  all  of  the  different  types  of  software  that  it  produces.  To  do  that  we 
established  certain  objectives  based  upon  management  and  division  inputs.  A 
committee  of  senior  personnel  from  each  branch  within  the  organization  was 
then  set  up  to  develop  this  metrics  plan.  Using  inputs  from  the  organization, 
the  established  objectives,  and  researching  what  other  organizations  had  done 
in  this  area,  a  candidate  set  of  metrics  have  been  identified.  The  selected 
metrics  were  grouped  into  three  categories  based  upon  impact,  data 
availability,  and  ease  of  implementation.  Over  the  next  year  data  will  be 
collected,  analyzed,  and  the  results  fed  back  into  the  process. 

Key  aspects  in  the  development  of  this  plan  have  been: 

1)  Establishing  objectives,  2)  getting  participation  from  all  levels  of  the 
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organization,  3)  using  an  iterative  approach  for  implementation,  and  4) 
maximizing  the  use  of  data  that  is  currently  being  collected.  For  this  to  be 
a  viable  program,  getting  the  support  of  all  personnel  within  the  division  is 
essential.  To  achieve  this,  ensuring  their  participation  in  developing  the 
program  and  training  them  in  the  effective  use  of  metrics  is  tantamount.  By 
demonstrating  the  effective  use  of  metrics  in  an  iterative  fashion,  actively 
seeking  their  help  and  cooperation,  and  keeping  the  lines  of  communication 
open,  we  believe  we'll  have  that  support. 
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ABSTRACT 

The  key  to  designing  a  real-time,  large, 
complex  system  is  to  optimize  the  design  to  meet  the 
requirements  and  desired  measure  of  effectiveness.  In 
order  to  achieve  this,  the  system  engineer/analyst  must 
have  the  capability  to  specify  the  design  goals/criteria, 
to  quantify  various  aspects  of  the  design,  and  to 
perform  trade-offs  among  different  design  goals.  One 
of  the  mechanisms  that  provides  these  capabilities  is 
the  System  Design  Factors.  Whether  the  system  design 
emphasis  is  on  real-time,  largeness,  complexity, 
parallelism  or  any  specific  criteria,  it  requires  a  set  of 
System  Design  Factor  to  describe  the  properties, 
attributes  and  characteristics  of  that  system.  Each 
System  Design  Factor  must  have  its  own  metric  to 
gauge  every  detail  of  that  system.  The  metric  describes 
the  weaknesses  and  strengths  of  a  specific  area  in  the 
design.  In  turn,  the  correlation  of  the  System  Design 
Factor  characterizesthe  completeness  and  robustness  of 
the  system.  Whether  the  system  is  designed  top-down, 
bottom-up,  or  middle-out,  the  System  Design  Factors 
have  major  influence  in  design  capture  and  analysis, 
design  structuring  decisions,  allocation  decisions,  and 
trade-off  decisions  between  various  design  structures 
and  resource  allocation  candidates. 

The  main  objectives  of  the  System  Design 
Factors  research  are  to  provide  a)  A  mechanism  to 
communicate  Jrom  the  customer  to  the  development 
team  throughout  various  phases  of  system  engineering, 
b)  A  mechanism  to  quantify  and  identify  a  large, 
complex,  real-time  system  i  strengths  and  wecdoiesses 
so  that  effective  comparison  of  different  systems  is 
achievable  and  c)  A  medtanism  for  linkage  of  various 
aspects  of  the  design,  which  help  the  system  engineeror 
analyst  to  specify,  capture,  analyze,  design,  prototype, 
test,  evaluate,  trade-off  and  implement  the  system 
effectively.  This  paper  presents  a  set  of  highly  utilized 
System  Design  Factors  that  system  engineers  or  analysts 
should  consider  early  in  the  design  to  produce  an 
effective  system  fHNH9I],  [HNH92]. 


KEYWORDS:  System  Design  Factors,  Structure 
Design  Decision,  Allocation  Decision,  (^timization 
Decision,  Trade-ofT  Decision,  Large  Con^>Iex  Real- 
Time  System. 

1  INTRODUCTION 

The  way  a  system  is  traditionally  built,  starts 
with  a  customer  who  defines  what  is  needed.  These 
needs  are  analyzed  to  determine  the  requirements 
and  specifications  [EdH91].  In  turn,  these 
requirements  and  specifications  are  captured  to 
produce  the  initial  design  [Hoa91].  Analysis  is 
executed  to  assure  that  the  initial  design  is  conq>lete 
and  consistent  [BiF90],  [Hoa91].  This  design  is 
optimized  iteratively  until  a  feasible  or  c^timal  design 
is  achieved  [HNH91],  [HNH92].  Collected  results 
are  then  passed  through  for  rapid  prototype, 
assessmoit,  evaluation,  test  and  refinement  to  yield 
the  final  design  [BoB85]  [CYH91],  [JeY91],  [Kam91], 
[SvL76].  Implementation  and  test  are  then  carried 
out  to  produce  the  final  product,  which  is  delivered  to 
the  customer.  Many  times,  the  customer  will 
complain  to  the  developer  that  the  system  did  not 
meet  the  needs.  The  common  causes  for  failing  to 
meet  the  requirements  might  be  one  of  the  following: 
(a)  the  needs  specified  by  the  customer  were  not 
specific  raough;  (b)  the  needs  were  never  clearly 
understood  by  the  developers;  or  (c)  communication 
among  developers  distorted  the  requirements  as  the 
development  processes  were  performed. 

The  information  understood  by  the  whole 
system  development  team  is  crucial  to  produce  the 
Hna!  product  that  meets  the  customer  *s  needs.  The 
current  system  engineering  methodology  lacks  this 
communication  mechanism  from  the  customer  to  the 
whole  development  team. 

The  first  objective  of  System  Design  Factors 
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research  is  to  provide  one  such  conunuoication 
mechanism.  In  general,  a  system  engineer  or  a 
customer  wants  some  form  to  specify  what  criteria  the 
end-result-system  must  meet.  Depending  on  the 
desired  criteria,  it  affects  how  the  system  would  be 
designed  and  developed.  These  criteria  are  in  turn 
the  factors  that  the  engineer  must  consider  early  in 
order  to  avoid  bad  designs,  reduce  cost,  and  optimize 
productivity  IHHN90a],  [HHN90b]. 

The  second  and  third  objectives  are 
addressed  by  the  following  situation.  Consider  a 
situation  where  two  system  engineers  were  assigned 
to  build  a  system  independently  given  the  same 
requirements  and  speciHcations  from  the  same 
customer.  When  the  two  engineers  delivered  two 
systems  to  the  customer,  if  the  customer  asks  to 
compare  quantitatively  and  qualitatively  the  different 
properties  in  term  of  performance,  dependability, 
security,  and  real-time  responsiveness  of  these  two 
systems,  then  how  does  this  comparison  proceed. 
The  second  and  third  objectives  of  this  research 
addressed  this  question.  These  objectives  provide  the 
mechanism  for  quantifying  design  goals  of  large, 
complex,  real-time  systems.  With  the  current  state  of 
the  system  engineering  technology,  there  are  no 
normalized  techniques  to  quantify  and  compare 
systems.  If  the  system's  properties  could  be  specified 
quantitatively  and  qualitatively  then  its  strengths  and 
weaknesses  can  be  identified  and  effective  comparison 
among  different  systems  can  be  achieved.  Being  able 
to  qualitatively  measure  the  system  will  not  only 
benefit  the  system  engineers  for  evaluation  purposes, 
but  it  will  also  provide  a  benefit  during  the 
requirements  speciBcation  phase,  capture  phase, 
analysis  phase,  design  phase,  optimization  phase,  and 
trade-off  phase. 

The  proposed  solution  to  these  problems  is 
to  formulate  hierarchical  System  Design  Factors 
(SDF).  The  short  term  goal  is  to  collect  concepts  and 
ideas  from  government,  industry,  and  academic 
sources  to  formulate  a  complete  and  robust  system 
specification.  The  individual  factors  will  be  studied 
independently.  The  correlation  of  factors  will  be 
investigated.  Testings  and  applications  ivill  be  made 
to  verify  the  correctness  and  consistency  of  the 
formulation.  The  long  term  goals  are  to  refme  the 
formulation,  provide  automation,  and  provide  new 
system  engineering  mechanisms  and  concepts  that  will 
have  significant  impact  into  the  next  generation  of 
system  engineering  methodology. 

The  remainder  of  this  paper  is  organized  as 


follows;  Section  2:  System  Design  Factors  Taxonomy 
provides  hierarchical  view  of  SDF  and  provides 
current  direction  and  focus  of  the  research.  Section 
3:  Example  provides  the  touch  and  feel  of  SDF. 
Section  4:  Specification  and  Use  of  SDF  provides  the 
utilization  of  the  SDF  template.  Section  5:  Current 
Status  provides  progress  information;  Section  6: 
Conclusion  and  Future  Plans  to  provide  on-going 
research  pursuit. 

2  SYSTEM  DESIGN  FACTORS  TAXONOMY 

The  current  thrust  of  this  research  is  to 
define  and  formulate  the  System  Design  Factors  and 
their  relationship.  These  factors  are  categorized. 
The  formulation  of  these  factors  expresses  the 
relationship  and  behavior  of  closely  and  loosely 
associate  factors.  The  effect  of  the  individual  factors 
on  the  design  or  engineering  process  is  being  studied. 
The  correlation  of  multiple  factors  is  also  undergoing 
study.  The  rating,  normalizing,  and  voting  techniques 
for  these  factors  are  being  derived.  The  research  is 
expected  to  generate  a  robust  SDF  taxonomy.  Each 
factor  will  consist  of  terminology,  definition,  source, 
metrics,  example,  usage,  and  notes. 

Currently  there  are  eleven  major  groupings 
of  factors  that  seem  to  be  required  for  most  large, 
complex,  real-time  systems.  These  groupings  are 
arbitrary.  Each  of  the  groupings  consists  of  factors 
that  are  closely  associated  with  other  factors,  which 
ultimately  affect  the  factor 's  behavior  by  inheritance. 
This  hierarchical  taxonomy  will  evolve  as  this  research 
effort  progresses.  The  Current  SDF  taxonomy  is 
shown  below  in  Figure  1  (without  any  detail 
description  due  to  the  space  permitted)  to 
demonstrate  the  SDF  frameworic.  This  taxonomy 
provides  a  set  System  Design  Factors  that  customers, 
system  engineers,  or  analysts  should  consider  early  in 
the  design  in  order  to  produce  an  effective  system 
[HNH91J.  [HNH92I,  [HHN90aJ,  rHNH90bJ. 

3  EXAMPLE 

This  section  gives  some  small  examples 
where  SDF  are  used.  Before  this  can  begin,  the 
characteristics  structure  must  Brst  be  introduced.  In 
order  to  effectively  introduce  the  characteristics 
structure,  some  definitions  are  provided  to  give  a 
conunon  understanding. 
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1 


ff/ronniA/ifce 

1.1  RESraNSC  TIM£ 

1.2  CM^ABIUTV 

1.3  RCLATWC  ACTIVITY 

1.4  SPEED 

1.6  THAOtiQHPUT 

1.6  LATENCY 

1.7  LOADftALANCMQ 

1.7.1  MFORMATION  LOAD 

1.7.2  PnOCESSMQ  LOAD 

1.8  GRACEFUL  OCGRADASUiTY  /  LOAD  SHEDDMQ 

1.8  EFFICIENCV 

1.10  PRCDICTAMJTY 

2  fTAi-rtm 

2.1  HAffiNESS 

2.2  HARO  OEADUNES 

2.3.0.1  periook: 

2.2.0.2  APERIOOIC 
3.2.0.3  SPORADIC 

2.3  SOFTOCAOLMES 

2.3.0. 1  PERIOOIC 
2.3.0.2  APERIODIC 
2.3.0.3  SPORADIC 

2.4  TEMPORAL  DISTANCE 

2.6  TARDINESS 

2.6  NUMBER  OF  CONSECUnVELV  MISSED  DEADLMES 

2.7  PREOICTABIUTIES 

2.8  GRACEFUL  DEGRADATION 

3  COMfiUTAriONmtQCeSSIMO  KOUIVkieUTS 

3.1  MPORTANCE 

3.2  USEFULNESS 

3.3  PRIORtTY 

3.4  (COMPUTWC)  PORTAMUTY 

3.6  MTERRUPT/RESET  CAPAWLITICS 

4  DeffsoABiirrY 

4.1  RCUA8IUTY 

4.2  ACCURACY 

4.3  FAULT  TOLERANCE 

4.4  GRACEFUL  DCGRAaiUTY 

4.6  REDUNDANCY 

4.6.1  STATIC 

4.6.2  DYNAMIC 

4.6  AVA1LA8IUTY 

4.6.1  MHCRENT  AVAILASIUTV 

4.6.2  ACNfCVEO  AVAiLASiUry 

4.6.3  OPERATIONAL  AVAILAS1UTV 

4.6.4  CASE  OF  REPLACEMENT 

4.6.6  CRASH  RECOVERABHJTY 

4.6.6  COMPUTATION  HEAVY  PROCESS  EFFECTS 

4.7  QUALITY 

•  SFCU/mrr 

6.1  CLASSIFICATION 

5.2  TYPE  OF  DATA 

6.2.1  LEVEL  1 1CLAS8IFIE0) 

6.2.1 .1  TOP  SECRET  OR  ABOVE 
6  2.1.2  SECRET 

6.2.1 .3  CONFIDENTIAL 

6.2.2  LEVEL  II  (SENSITIVEl 

6.2.2.1  PRIVACY  ACT/FMANCIAL 

6.2.2.2  FOR  OFFICIAL  USE  ONLY 

6.2.2.3  SENSITIVE  MANAGEMENT 

6.2.2.4  PROPRIETARY/PRIV1UEGEO 

6.2.3  UVEL  HI  WONSENSITIVE) 

5.2.3.1  (OTHER-NOT  CATEGORIES  M 

LEVEL  I  AND  11) 

6.3  PERCENTAGE  OF  PROCESSING  TIME  /  SECURHY  LOAD 

6.4  ENCRYPTION  TYPE  REQUIREMENTS 

6.6  IMPLEMENTATION  TECHNIQUES  REQUIREMENTS 

6  HUfUAf/WAfC 

6.1  EASE  OF  USE 

6.2  potential  OPERATOR  DECISIONS 

6.3  1  OPERATOR  DELAY  /  2.USCR  RESPONSE  TIME 

6.4  OPERATOR  ACnONIS) 

6.6  I.RCOUIKD  NUMBER  OF  OPERATORS  /  2.NUM8ER  OF 

SMUtTANEOUS  USERS 

6.6  U8CRMTENSITY 

6.7  AVERAGE  T»4C  FOR  EACH  CATEGOMES 

6.6  POTENTIAL  ERRORS 

7  PHYSiCAl  PEQUit^Menn 

7.1  SIZE  REQUIREMENTS 

7.1.1  HEIGKT 

7.1.2  WIDTH 

7.1.3  LENGTH 


7.1.4  DEPTH 

7.1.6  AREA 

7.1.6  VOLUME 

7.2  WEIGHT  REQUIREMENTS 

7.3  RUCiOAMJTY  1RUOGEOCZEO) 

7.4  SURVIVAMJTY 

7.6  (PHYSICAU  PORTAB8JTY 

7.6  ENERGY  REQUIREMENTS 

7.6.1  (ENERGY)  CONSUMPTION 

7.6.1.)  ELECTRICAL  (ENERGY  CONSUMED) 

7.6.1. 2  FUEL  (ENERGY  CONSUMED) 

7.6.1.3  OTHER  (ENERGY  CONSUMED) 

7.6.2  (BIERGY)  DISSIPATED 

7.7  IOCAT10NAL  OPCRATMQ  ENVIR0NMQ(T 

7.7.1  OEOGRAmiCAL  LOCATION 

7.7.2  MDOOflS/OUTDOORS 

7.7.3  TEMPERATURE 

7.7.4  HUMIOITV 

7.7.6  ACOUSTICAL  NOISE 

7.7.6  AW  FUMTYIQlUALiTY 

7.7.7  EXPOSURE  TO  WMD 

7.7.5  EXPOSURE  TO  WATER 

7.7.9  EXPOSURE  TO  ELECTROMAGNETIC  RADIATION 

7.7.10  VlBRATIONSjSTABILITV 

7.5  CUMATE  CONTROL 

7.8.1  COOUNC 
7.B.2  HEATMO 

7.8.3  HUMIDITY  CONTROL 

7.8.4  ACOUSTICAL  NOISE  SUPPRESSION 

7.8.6  AIR  PURITY/QUAUTY  CONTROL 

7.8.6  MOTION  stabilization 

7.8.7  LIGHTING 

7.8  MMiUFACTURMG  CONSIDERATIONS 

7.8.1  PRODUCTION  CAPACITY 

7.8.2  PRODUCTION  TIME 

7.10  COMPUTER 

7.10.1  CPU 

7.10  2  MEMORY 
7.10.3  STORAGE 

8  flNANCfAl  RFOU/REMEWTS 

8.1  COST  TO  DEVELOP 

8.2  COST  TO  PROTOTYPE 

8.3  COST  TO  PRODUCE 

8.4  COST  TO  TEST 

8.6  COST  TO  PURCHASE 

8.6  COST  TO  OPERATE 

8.7  COST  TO  MAMTAM 

8.8  COST  TO  REPAIR 

8.8  COST  TO  MCLUOE  SECURITY  CAPABIUTY 

8.10  PRODUCDVITV 
8  rwf  PROJCCFEP 

8.1  EST84ATE0  TME  TO  DEVELOP 

8.2  ESTWIATED  TME  TO  PROTOTYPE 

8.3  ESTMATED  TME  TO  PRODUCE 

8.4  ESTMATED  TME  TO  TEST 

9.6  ESTMATED  TME  TO  PURCHASE 

8.8  ESTMATED  TME  TO  OPERATE 

8.7  ESTMATED  TME  TO  MAMTAM 

9.8  ESTMATED  TME  TO  REPAIR 

8.8  ESTMATED  TME  TO  INCLUDE  SECURITY  CAPABIUTY 
18  lf€  CYCl£ 

10.1  TESTABIUTY 

10.2  MAMTENANCE 

10.2.1  EASE  OF  MAMTENANCE 

10.2.2  NUMBER  OF  PERSON  NEED  TO  MAINTAIN 

10.2.3  NOTIFICATION 

10.2.4  FREQUENCY 

10.2.6  MAINTENANCE  DOWNTMERTURATION 
10.2.B  DEGREE  OF  SYSTEM  DISABILITY 

10.2.7  WHEN  MAINTENANCE  COMES  DUE 

10.2.8  DURMC  MAMTENANCE 

10.2.8  WEARUFETME 

10.3  OBSOLESCENCE  UFETME 

10.4  REUSC-AWUTY 
11  fVTUf€  NEEDS  CONSPfR4  TIONS 

lt.1  AOAPtABtUTYffLEXIBIUTY 

11.2  EXPANDAWUTY 

11.3  COMPAimiTY 

11.4  MTERGRABRJTY 

11.6  MTEROPERABIUTY 

11.6  MTtORITY 


FIGURE  1.  TAXONOMY  OF  SYSTEM  DESIGN  FACTORS 


Quandtative  Value  is  a  quantifiable  measurement.  It  is  a 
numerical  value.  It  represents  the  degree  of  excellence.  Some 
value  may  have  different  type  of  range  or  minimum  and  maximum 


cardinality.  For  example,  temperature  could  be  measured  as  1205 
degree  of  Kelvin  and  could  vary  only  between  0  and  277.15  degree. 
Attribute  is  the  quality  of  a  person  or  thing  (non-physical) 
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Property  is  the  attribute  which  belongs  to  some  one  or  some  thing 
(physical). 

Characteristics  is  any  special  feature  of  a  person  or  thing. 

The  hierarchical  relationship  among  these 
definitions  forms  a  characteristics  structure  which 
provides  a  general  mechanism  for  quantification. 
This  mechanism  is  applied  with  the  System  Design 
Factors  to  quantify  systems.  An  example  is  given  to 
demonstrate  the  relationship  among  these  definitions. 
The  example  in  Figure  2  shown  hierarchically  a 
Subject  has  Properties  which  have  Attributes  which 
in  turn  have  Quantitative  Value  and  Qualitative 
Value  (Characteristics).  Consider  an  eagle  who  has 
the  following  properties;  Performance,  Life-Cycle, 
and  Physical  Requirements.  Performance  which  in 
turn  have  the  following  attributes:  Air  Speed,  Land 
Speed,  and  Take  Off  Time.  Life  Cycle  which  in  turn 
has  the  following  attributes;  Overall  Sickness  Time 
(health)  and  Life  Span.  Physical  Requirements  which 
in  turn  have  the  following  attributes:  Size,  Color,  and 
Wing  Span.  Size  which  in  turn  has  the  following 
quantitative  value  (i.e., could  vary  between  O.Sto  2.0 
feet)  and  qualitative  value  (characteristics)  (i.e., could 
be  small,  medium,  or  large).  The  rest  of  the 
quantitative  and  qualitative  values  are  shown  in 
Figure  2. 


FIQURE  2.  EXAMPLE  OF  CHARACTERISTICS  STRUCTURE 


In  the  above  example,  the  Subject  was  an 
eagle.  However,  the  Subject  can  be  substitut&l  with 
one  of  the  following:  system,  subsystem,  component. 


module,  task,  node,  device,  or  any  object.  This 
characteristics  structure  provides  a  low  level  or 
detailed  link  to  the  criteria  which  in  turn  provides  a 
high  level  link  to  the  System  Design  Factors.  In  other 
word,  the  characteristics  structure  applied  to  eagle  to 
allows  us  to  quantify  and  rate  different  aspects  of  its 
species.  This  similar  approach  can  be  applied  to  the 
system  there  by  allows  to  quantify  and  rate  different 
factors  of  the  system.  The  application  of  the 
characteristics  structure  to  the  System  Design  Factors 
is  demonstrated  in  Figure  3. 


As  illustrated  in  this  example  (Figure  3),  a 
customer  may  need  to  rate,  measure,  or  design  the 
system  in  term  of  the  following  Properties; 
performance.  Dependability  [Joh85],  [WaH91],  and 
Physical  Requirements.  Performance  which  in  turn 
has  the  following  Attributes;  Response  Time, 
Throughput,  and  Latency.  Dependability  which  in  turn 
has  the  following  Attributes;  Reliability  and  Fault- 
Tolerance.  Physical  Requirements  which  in  turn  has 
the  following  Attributes:  Weight,  Size,  and  Power. 
The  Response  Time  could  vary  between  0.0  to  3.0 
and  its  characteristics  could  be  fast,  medium,  or  slow. 
The  rest  of  the  quantitative  and  qualitative  values  are 
shown  in  the  graph.  This  mechanism  allows  one  to 
identify  and  effectively  compare  the  strengths  and 
weaknesses  of  different  systems. 
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4  SPECIFICATION  AND  USE  OF  SDF 

The  example  in  Figure  3  shows  the  overall  or 
top  level  application  of  the  SDF.  The  detail 
application  of  SDF  is  demonstrated  through  The 
System  Design  Factors  Template  (Figure  4).  The 
purpose  of  this  template  is  to  provide  a  general 
format  to  guide  the  system  engineer  or  the  customer 
in  the  application  of  the  System  Design  Factors.  It 
assists  the  engineer/customer  to  specify  what 
goal/criteria  he  wanted  to  measure  and  allows  the 
template  to  be  attached  or  probed  onto  a  subsystem, 
a  component,  an  object,  or  the  whole  system  itself 
just  like  in  the  previous  examples.  This  provides  the 
metrification  mechanism  to  quantify  the  various 
aspects  of  design. 


1. 

N«Rw:  Wihnrry  Bmm  fofmm 

2. 

Type:  Ptob^bOltY 

3. 
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.9B9  ontoroi/  Boquksd 

1.01  *  Psefmma  auOgotoa 

0. 
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b.  By  Type 

c.  By  Deiign  Factor 

d.  By  View 

c.  By  comporwtt 

10. 

Releiance:  Authof's  nstm 

11. 

Definition:  Tmit  Book 

12. 

Annotatten;  Convnmts 

13 

Next  Tamplare: 

FIGURE  4.  SYSTEM  DESIGN  FACTORS  TEMPLATE 


The  initial  template  was  formulated  and  an 
example  is  given  in  Figure  4  to  get  the  touch  and  feet 
of  the  template.  Currently,  there  are  thirteen  items  in 
this  template.  The  Name  item  is  a  slot  holder  for  the 
name  of  a  specific  design  factor  (e.g..  Reliability  of 
Beam  Former).  The  Type  item  is  a  slot  holder  for 
the  classification  of  the  factor  (e.g.,  Probability). 
The  Range  item  is  a  slot  holder  for  the  minimum  and 
maximum  value  or  the  cardinality  of  the  factor  (e.g., 
0.0  to  1.0).  The  Units  item  is  a  slot  holder  for  the 
unit  of  measurement  of  the  factor  (e.g..  Units  of 
Probability).  The  Methods/Principle  item  is  a  slot 
holder  for  the  approaches  or  techniques  that  the 
designer/customer  considered  to  be  associated  with 
this  factor  (e.g..  Fault  Tolerance,  Highly  Reliable 
Component).  The  Rationale  item  is  a  slot  holder  for 


the  reason  that  this  factor  applies  to  a  specific 
component/object  (e.g..  Life  Critical  Function).  The 
Relationship  Item  is  the  slot  holder  for  the  list  of 
closely  associated  factors  (e.g..  Availability,  Fault 
Tolerance).  The  Relational  Expression  field  in  this 
item  provides  the  slot  for  the  list  of  correlations 
between  this  factor  and  its  closely  associate*'  factors 
(e.g.,Positive  correlation.  Negative  Correla'.ioi.).  The 
Quantification  Item  contains  the  Type  and  Formula 
fields.  The  Type  field  in  this  item  is  the  slot  holder 
for  either  integer,  float,  double,  short,  or  long.  The 
Formula  field  in  this  item  currently  provides  the  slot 
for  three  mathematical  expressions.  They  are  (1) 
actually  calculated  (e.g.,  R(t)  =  1  -  F(t)  ),  (2) 
required  to  be  a  specific  value  (e.g.,  0.989), and  (3) 
budgeted  by  designer  or  customer  (e.g.,  1.01*0.989). 
The  Consistency  Rule  Item  consists  of  By- 
Aggregation,  By-Type,  By-Design  Factors,  By-View, 
and  By-Component  rules.  For  example,  By- 
Aggregation  field  provides  a  slot  that  holds  the  rule 
for  governing  this  factor  consistency  through  out  the 
hierarchy  (e.g..  Use  Rule  X  and  Rule  Y).  The 
Reference  item  is  a  slot  holder  for  the  source  of 
reference  or  the  name  of  the  author  that  this  factor 
has  been  formulated  by.  The  Definition  item  is  a  slot 
holder  for  the  clarity  for  this  factor.  The  Annotation 
Item  is  the  slot  holder  for  commenting  relevant 
information  or  providing  warnings  related  to  this 
factor.  Lastly,  the  Next  Template  item  is  not 
completely  defmed  at  this  time,  but  it  is  the  slot 
holder  for  any  detail  specification  that  may  not 
require  the  customer ’s  direction. 


FIOURE  6.  DESIGN  FACTORS  DEPENDENCIES  GRAPH 


The  advantage  of  this  template  is  not  just  to 
ease  the  use  of  the  factors  but  it  also  allows  the 
designer/customer  to  take  the  available  factors  and 
customize  his  own  design  factors  that  are  appropriate 
for  his  specific  needs.  It  is  up  to  the 
Engineer/customer  to  decide  the  important  and 
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unimportant  factors  and  formulabs  th«  design  goal 
and  design  decision  that  the  end-rendt  system  must 
meet.  The  overall  design  goal  and  design  decision  of 
the  system  can  be  described  by  the  System  Design 
Factors  dependencies  graph  shown  in  Figure  S. 

The  upper  half  of  the  graph  is  referred  to  as 
the  goal  oriented  design  factors,  while  the  lower  half 
is  referred  to  as  the  decision  oneoUaJ  design 
parameters.  The  goal  oriented  is  independent  of 
implementation  model  while  the  decision  onentod  u 
dependent  on  the  implemenUlioo  model.  It  would  be 
ideal  for  the  design  to  be  implemenuiion 
independent  in  design  phase,  however  in  practice  it  is 
not  always  the  case.  SOT  dependencies  provide  the 
linkage  between  the  implemenuiion  independent 
(Design  Goal)  and  implemenuiion  dejsendent 
(Design  Parameter).  The  SDF  dependencies  graph 
assists  the  engineer/customer  in  undersUnding  the 
behavior  change  of  the  individual  factor.  These 
changes  are  based  on  its'  closely  and  loosely 
associated  factors.  The  behavior  of  each  sul^ystera, 
component,  object,  or  the  whole  system  with  respect 
to  different  factors  (design  goals)  can  be  analyzed 
separately  or  simultaneously. 


FIQUHE  8.  SINGIE  CWTERIA  OBJECTIVE  FUNCTION 


Although  the  scope  of  this  paper  is  not  to 
cover  Design  Structuring  and  Allocation  Optimization 
methodology,  it  is  worth  showing  some  applications  of 
SDF  with  such  a  method  {HHN90a].  (HHN90bJ. 
[HNH91),  [HNH92].  Assume  that  a  customer 
procured  a  contract  to  develop  a  system  such  that  the 
end-result  system  is  required  to  meet  cerUin 
measurements  in  terms  of  Performance, 
Dependability,  Cost,  and  Security.  As  illustrated  with 
She  previous  template  example,  the  engineer  can 
specify  and  attach  these  required  factors  to  the 
design.  Based  on  the  design  goal  and  design 


paramirtcr.  the  eogu»<*r  can  uilor  the  single  cnlcna 
or  multi -entena  objective  function  for  optmuzation 
{Naf91J 

lliis  tksign  IS  then  optimizixi  based  on  the 
uilored  objective  function.  The  af^ruach  that 
the  engineer  cotild  take  is  to  t^tmize  the  design  with 
a  single  cnlcna  objective  function  (shown  m  Figure  6) 
and  then  overlay  the  reandl  (^lown  in  Figure  7)  to 
execute  tnuk-ofT  analysis  (Do«9t}.  The  second 
approach  is  to  optimize  (he  design  with  niuiu-cnteru 
objective  function  (shown  in  Figure  8)  The  first 
approach  optimiziai  the  criteria  twe  at  a  time,  while 
the  later  approach  ofHimizcs  these  entena 
simultaneously. 


F10W«  7.  SttOlE  C«ITt«lA  OfiJfCTfVt  FUNCTION  OVt«l.AV 


FIOUHE  8.  MUtTI  CRITERIA  OBJECTIVE  FUNCTION 


The  results  of  single  and  multi-critem 
objective  function  together  with  the  SDF 
dcoendeocies  graph  provide  the  enginwr  with  a 
better  understanding  of  the  system  under  design.  By 
understanding  the  physical  nature  or  correlation 
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among  the  factors,  the  designer/customer  can  predict 
the  behavior  and  performs  effective  trade-off.  The 
application  of  the  SDF  with  optimization  here  merely 
demonstrates  some  utility  of  the  SDF.  SDF  can  be 
applied  through-out  various  phases  of  system 
engineering.  It  is  a  critical  component  in  system 
engineering. 

5  ..UMMARY  OF  CURRENT  STATUS 

A  list  of  System  Design  Factors  was 
generated  and  structured  in  the  taxonomy  format. 
There  are  eleven  main  groupings  of  factors  and  their 
closely  associated  factors  defmed  so  far.  The 
relationship  of  these  factors  is  not  well  understood  at 
the  present  time  but  we  are  attempting  to  correlate 
these  factors  as  this  effort  progresses.  An  initial 
System  Design  Factors  technical  report  is  drafted. 
This  draft  provides  a  detailed  description  of  each 
design  factor.  The  description  consists  of  the 
terminology,  definition,  source,  metrics,  example, 
usage,  and  note.  The  terminology  provides  the 
commonly  used  vocabulary  word.  The  definition 
provides  the  meanings  of  the  factor.  The  source 
provides  the  reference  of  the  definitions.  The  metrics 
(JuA91]  provide  the  unit  of  measurement  (dimension) 
of  the  factor.  The  example  provides  some  illustration 
of  the  factor.  The  usage  provides  the  cases  when, 
where,  how,  and  why  to  apply  the  factor.  Lastly,  the 
note  provides  any  relevant  information  or  warning 
related  to  the  factor.  Initial  bDF  template  and 
example  were  demonstrated  to  get  the  feel  of  the 
formulation.  The  prototyping  of  the  SDF  template 
is  underway.  Initial  System  Design  Factors  focus 
group  has  been  established  to  collaborate  and  to 
clarify  issues  in  the  SDF  formulation. 

6  CONCLUSION  AND  FUTURE  PLAN 

The  goal  of  this  effort  is  to  generate  a  list  of 
System  Design  Factors.  These  factors  are  intended  to 
be  used  throughout  the  whole  system  engineering 
process.  For  instance,  they  are  used  to  specify  in  the 
requirements  phase,  encapsulate  in  the  capturing 
phase,  quantify  and  evaluate  in  the  analysis  phase, 
characterize  in  the  optimization  phase  and,  justify  in 
the  design  trade-off  phase.  These  factors  are  critical 
to  the  system  engineering  process. 

The  Future  Plans  included  refining, 
restructuring  and  streamlining  (if  necessary)  the 
System  Design  Factors.  More  dedicated  research 


effort  is  being  considered  to  focus  on  a  smaller  but 
widely  use  set  of  design  factors.  From  this  smaller 
set  of  design  factors,  intensive  correlation  will  be 
studied.  The  formulation  will  be  incorporated  into 
the  sonar  example  (Hoa9i]  and  the  Destination  Level 
I  Prototype  [HNH92].  [HNH91]  in  other  research 
efforts  for  testing  and  refining. 

The  lessons  learned  in  this  effort  will  benefit 
the  whole  systems  engineering  community.  The  list 
is  expected  to  evolve  as  this  eTort  progresses.  This  is 
a  collaborative  effort  among  Na  ai  .''urface  Warfare 
Center  (NSWCDDWODET),  DoD,  other  government 
agencies.  Industry,  and  University  communities. 
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Abstract 

Maturing  software  CASE  tools  and  more 
defined  software  processes  now  provide  the 
foundation  for  changing  software  production 
from  an  art  into  an  engineering  discipline. 
An  essential  next  step  is  to  provide  objective 
means  to  assess  the  intrinsic  qualities  of 
software.  In  particular,  intrinsic  properties 
like  complexity  have  been  shown  to 
dominate  ^e  development  and  maintenance 
cost  drivers.  This  paper  reports  on  a  method 
to  objectively  assess  software  qualities 
related  to  complexity,  including  modularity, 
cohesion,  maintainability,  etc.  The  new 
methods  are  described  and  successful 
application  to  large  DoD  software  systems 
reviewed.  The  technique  should  be 
extendable  to  assess  specific  structural 
properties  of  software  related  to  security, 
safety,  and  reliability. 


bitroduction 

System  design  assessment  can  have  many 
goals,  of  which  the  most  common  are  risk 
assessment  or  requirements  verification. 
One  reason  assessment  activity  is  required 
is  that  software  as-built  is  almost  never 
compliant  with  the  program  intent.  One 
important  cause  of  non-compliance  in 
software  systems  is  that  software  is  usually 
more  complex  than  necessary  and  this 
unnecessary  complexity  increases  risk  and 
raises  the  development,  verification,  and 
validation  costs. 

The  adverse  effect  of  complexity  on  the 
software  life  cycle  cost  is  not  just  intuitive, 
there  is  documented  evidence  of  its  effect  In 
particular,  it  is  known  to  be  a  dominant  cost 
driver  for  both  the  development  and 
maintenance  phases,  to  increase  the  cost  of 
test  and  intcgraJon,  to  make  the 


achievement  of  properties  like  security  very 
difficult,  and  to  make  maintenance 
expensive.  For  example,  the  COCOMO 
costing  model  provides  cost  multipliers  that 
help  determine  the  cost  of  a  software  system 
[1].  Figure  1  shows  the  ratio  of  the  maximum 
to  the  minimum  of  each  cost  driver 
multiplier  to  illustrate  how  sensitive  the  total 
cost  function  is  to  a  reduction  in  the  value  of 
a  cost  driver.  Complexity  is  the  cost  driver 
with  the  most  leverage.  That  is,  a 
proportionate  reduction  in  the  system 
complexity  affects  the  system  cost  more  than 
a  proportionate  reduction  in  any  other  cost 
driver.  Complexity  reduction  has  even  more 
economic  leverage  than  the  use  of  modem 
software  practices  or  software  development 
tools! 

In  addition  to  cost,  complexity  has  adverse 
effects  on  many  system  properties  such  as 
security,  safety,  and  fault  tolerance.  These 
attributes  need  high  assurance,  and 
unnecessary  complexity  in  the  system 
makes  such  assurance  either  very  costly  or 
impossible  to  obtain.  Further,  the  life  cycle 
cost  of  software  is  critically  aiffected  by  our 
ability  to  easily  maintain  and  modify 
software,  and  many  studies  have  shown  that 
overly  complex  systems  have  an  inordinate 
maintenance  cost  [2,3]. 

Despite  the  advantages  of  reducing  software 
complexity,  no  system  can  be  arbitrarily 
simple:  requirements  impose  a  minimal 
complexity  upon  any  solution.  Instead,  we 
intend  to  manage  and  control  complexity 
growth  and,  if  possible,  to  reduce  inessential 
complexity  in  software  interns. 

To  do  this,  we  are  developing  an  objective 
software  assessment  technique  to 
characterize  software  complexity,  focusii  „ 
on  those  complexity  issues  that  most  likely 
affect  life  Qrcle  cost  by  affecting  software 
modularity,  reusability,  maintainability, 
etc.  We  have  begun  by  assessing  software 
implementations  for  complexity  properties, 
but  we  believe  that  our  technique  can  be 
extended  to  assess  the  complexity  of  software 
designs  and  maybe  even  the  complexity  of 
software  requirements. 


155 


3- 


Fig.  1:  Cost  Driver  Sensitivity 


By  itself,  this  goal  is  not  sufficient.  To 
ensure  that  the  assessment  technique  is 
useful,  we  have: 

•  Used  it  to  determine  the  complexity 
in  software  produced  by  good 
practice, 

•  Developed  techniques  that  reduce 
complexity  in  reasonable  instances 
and  proved  those  techniques  work, 

•  Assessed  the  benefits  of  the  entire 
process. 

Although  our  initial  results  reported  here  are 
preliminary,  we  believe  there  is  sufficient 
promise  that  we  have  committed  to  the 
construction  of  a  second  generation 
assessment  technology. 

Technical  Anoroarh 

Our  approach  to  assessing  software 
complexity  is  to  apply  pattern  analysis 
technology  to  software  systems.  We  believe 
that  as  software  systems  are  made  more 
regular  (less  complex),  their  design 
integrity  improves  and  the  life  cycle  cost 
should  decrease.  Generally,  pattern 
analysis  as  a  conceptual  approach  has  the 


advantage  that  it  looks  for  high  level 
characteristics,  assessing  large  scale, 
system  level  attributes  that  more  traditional, 
detail-oriented  assessments  miss.  On  the 
other  hand,  pattern  analysis  as  an  approach 
is  often  confused  by  small  variations  and 
either  fails  to  recognize  good  patterns,  or 
identifies  good  patterns  with  slight 
variations  as  bad.  This  sensitivity  to  false 
identifications  is  well  known  and  is  one  of 
our  more  significant  challenges. 

The  potential  applicability  of  pattern 
analysis  for  software  complexity 
assessment  is  relatively  easy  to  describe, 
even  as  the  real  utility  if  less  obvious.  For 
example,  it  is  commonly  accepted  that  well- 
engineered  software  is  modular,  hides 
information  within  modules,  and 
minimizes  the  complexity  of  inter-module 
couplings.  These  characteristics,  accepted 
since  the  1960’s,  are  the  foundation  of  the 
work  of  authors  such  as  Myers  and  Pamas. 
One  difficulty  with  the  concepts  is  that  they 
are  relative  and  descriptive,  rather  than 
absolute  and  prescriptive.  Although  they 
suggest  that  modules  should  be  more  tightly 
coupled  to  their  internal  elements  than  to 
external  elements,  the  definition  of  “more" 
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or  the  specification  of  how  much  external 
coupling  is  acceptable  is  answerable  only  in 
the  context  of  specific  software  instances. 
Nevertheless,  a  knowledgeable  software 
engineer  can  sometimes  examine  the 
modularity  properties  of  a  program  and 
quickly  assess  whether  or  not  it  is  acceptable, 
even  though  determining  specific 
modification  su^estions  may  be  much  more 
time  consuming.  Making  such  an 
assessment  process  more  objective  is  one 
goal  of  our  pattern  analysis  technique. 

Consider  a  program  as  a  complex  network  of 
relationships  such  as  data  references, 
control  references,  semantic  dependencies 
(e.g.,  types),  exception  propagations,  etc.  If 
the  program  was  constructed  with  definite 
methods,  we  should  expect  that  the  methods  or 
their  goals  are  evidenced  in  the  final 
product.  In  particular,  because  most 
software  engineering  methods  emphasize 
the  control  of  relationships  (e.g.,  interfaces, 
data  references),  an  analysis  of 
relationships  should  reveal  the  effect  of  the 
software  design  and  construction 
disciplines.  However,  the  analysis  of 
relationships  is  challenging  because 
appropriate  relationships  are  not  regular 
constructs  like  bricks.  If  they  were  regular, 
the  detection  of  inappropriate  patterns  would 
be  as  easy  as  detecting  the  irregularities  in  a 
brick  wall.  Instead,  the  net  effect  of  the 
many  indirect  couplings  between  parts  of  an 
software  system  provide  a  much  more 
complex  picture  that  is  difficult  to 
understand. 

The  complex  network  of  interconnections 
within  a  program  defines  the  graph  to  be 
analyzed.  We  will  use  the  analogy  of 
analyzing  a  business  organization  for 
comparison.  The  usual  first  step  is  to  focus 
on  the  small  details  of  the  organization,  the 
specific  work  rules,  time  card  practices, 
personnel  policy,  etc.  Although  such 
information  can  characterize  the  spirit  and 
general  tone  of  an  organization,  it  does 
relatively  little  to  clarify  its  organizational 
stricture.  By  analogy,  analyzing  software 
for  density  of  comments,  use  of  risky 
constructs,  etc.  gives  only  a  limited  and 
sometimes  misleading  picture  of  the  overall 
system. 


At  the  next  level  of  analysis,  we  would 
identify  the  minor  organization  structures 
(e.g.,  work  groups  within  departments)  and 
their  relationship  to  each  other.  The  minor 
structures  often  do  not  respect  formal 
organizational  boundaries  but  represent 
teams  of  organizationally  dispersed 
workers  that  cooperate  extensively.  In  the 
same  way,  we  can  identify  program 
elements  (e.g.,  functions,  subprograms, 
tasks  in  Ada)  and  ask  how  they  cooperate 
within  the  program  to  achieve  specific  tasks. 

At  a  still  higher  level,  an  organization’s 
minor  elements  form  major  working 
groups.  These  larger  groups  characterize  the 
real  organizational  points  of  control  and 
critical  workgroups.  In  the  same  way,  the 
smaller  elements  of  a  program  form 
subsystems  or  abstractions  that  characterize 
the  larger  compositional  structure  of  the 
program. 

This  process  of  identifying  smaller 
structures  and  grouping  them  into  larger 
structures  in  a  recursive  manner  may  be 
repeated  many  times,  from  different  points 
of  view,  to  properly  characterize  a  software 
system  or  an  organization.  For  example,  if 
we  analyzed  the  organizational  flow  of 
forms,  or  of  capital  authorizations,  or  of 
proposal  writing,  or  of  the  coordination  of 
social  gatherings,  we  might  obtain  very 
different  pictures.  Similarly,  we  expect  that 
the  analysis  of  data  references,  type 
references,  calling  references,  exception 
processing,  etc.  will  reveal  different 
structures  in  a  software  ^stem. 

Organizations,  like  software,  are  not 
perfect;  they  do  not  operate  exactly  as  they 
were  designed  to  do.  Therefore,  we  should 
expect  to  find  work  groups  of  all  sizes  that 
differ  significantly  from  the  formal 
organization  chart.  When  this  is  so,  it  is 
essential  to  determine  whether  or  not  the 
organization  (or  the  software  system)  should 
be  reorganized  to  better  match  the  work 
processes.  Conversely,  we  might  also  find 
that  work  relationships  contain  long  chains 
of  essentially  non-functional  relationships. 
These  are  obvious  candidates  for 
simplification  in  both  software  systems  and 
organizations. 
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In  brief,  our  analysis  approach  examines  the 
patterns  of  relationships  within  a  software 
system  to  discover  the  naturally  cohesive 
elements  within  it.  We  do  this  by  assuming 
that  a  collection  of  modules  can  be  usefully 
described  as  a  cohesive  group  if  its  members 
more  complexly  relate  to  other  group 
members  than  they  relate  to  the  rest  of  the 
system.  From  another  perspective,  a 
cohesive  group  is  perceived  by  a  relatively 
low  complexity  boundary  between  it  and  the 
rest  of  the  system.  Our  approach  does  not 
criticize  a  grouping  by  labeling  it  an 
inappropriate  solution  to  the  design  problem, 
but  rather  it  identifies  unusual  complexity  in 
a  solution.  In  this  sense,  our  approach 
criticizes  not  the  system  design  itself,  but 
rather  the  workmanship  of  decomposing  the 
design  into  elements. 

The  utility  of  this  approach  depends  upon  the 
efficacy  of  cluster  analysis  algorithms. 
Cluster  analysis  performance  depends  upon 
our  choice  of  measures,  metrics  for  coupling 
strength,  and  upon  our  computational 
approach.  A  summary  of  some 
considerations  for  algorithm  design  is  in 
Table  1.  (More  information  on  cluster 
analysis  can  be  found  in  reviews  such  as 
tl2].) 

The  table  lists  properties  (such  as  a  measure) 
and  typical  examples  for  the  software 
analysis  problem,  e.g.,  data  references. 
Then  it  suggests  a  probable  strength  of  this 
measure  and  a  probable  weakness.  For  data 
references,  we  expect  that  they  should  be  well 
controlled  by  traditional  design  methods 
and  that  therefore  patterns  defined  by  data 
references  should  not  conflict  with  the 
design.  A  potential  weakness  of  this 
measure  is  that,  if  all  methods  do  succeed  in 
controlling  coupling  by  da^^^a  references,  then 
the  violations  to  be  revealed  will  be  few  and 
the  insight  gained  slight.  The  coupling 
strength  criterion  property  describes  the 
method  for  defining  coupling  to  be  present. 
The  simplest  approach  is  to  define  a  metric 
on  a  specified  measure  and  then  to  say  that 
any  coupling  stronger  than  a  specific  metric 
value  is  considered  coupled.  More  adaptive 


approaches  use  fractions  of  local  or  global 
averages  of  the  metric  to  define  coupling. 
Finally,  algorithms  can  be  divisive  or 
agglomerative.  Divisive  methods  tend  to  be 
computationally  expensive,  but  are  more 
able  to  discover  imperfect  clusters. 
Agglomerative  methods,  on  the  other  hand, 
are  usually  inexpensive  and  identify  small 
structures  more  readily,  but  they  can  also  be 
easily  mislead  by  the  presence  of  incidental 
couplings  or  unusual  structures 

Our  initial  attempt  used  an  agglomerative 
method  using  data  and  call  references,  and 
an  absolute  coupling  criterion.  From  this 
experience,  and  using  Table  1  as  a  table  of 
expectations  (not  results),  we  chose  one 
agglomerative  method  and  one  divisive 
method  for  additional  evaluation  as 
suggested  in  Table  2.  As  complexity  is 
identified,  we  also  will  explore  methods  that 
help  reduce  or  control  complexity  as 
summarized  in  Table  2. 

Complexity  Identification 

The  method  of  decoupled  groups  identifies 
groups  of  program  elements  that  are 
decoupled  by  a  specified  criteria,  such  as 
data  coupling,  using  an  bottom-up, 
agglomerative  clustering  technique.  We 
began  with  an  absolute  measure  of  coupling, 
e.g.,  any  software  module  that  references 
data  declared  in  another  module  is  coupled  to 
the  module  that  contains  the  data.  This 
approach  is  related  to  the  ideas  of  Pamas  and 
Myers.  Although  the  technique  did,  in  fact, 
provide  useful  insight,  we  also  found  that  it 
was  too  "ideal"  an  approach  and  often 
provided  unreasonable  criticism  of 
reasonable  software  structures.  Because 
many  programs  appear  to  have  clusters  of 
modules  that  relatively  intensively 
reference  each  other's  visible  data,  only  one 
reference  from  one  cluster  to  another  ties 
both  clusters  together.  In  this  way,  many 
otherwise  reasonably  structured  programs 
appear  to  be  totally  coupled,  whereas  the 
removal  of  only  6-8  isolated  data  references 
would  reveal  a  more  reasonable 
modularization  of  8-10  sub-systems. 
However,  the  approach  is  helpful  for 
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Rxqjerty 

Example 

Strength 

Weakness 

Measure 

Data  references 

Often  corresponds 
to  method  goals 

Programs  are  often  well 
designed 

Calls 

Reflects  functional 
decompositions 

Extensive  coupling  to 
primitives 

Types 

May  reflect  design 
abstractions 

Unexplored 

May  be  uncontrolled 

Withs 

Includes  previous 
three  measures 

Insufficient  for  corrective 
action 

Reliability 

emphasis 

Unexplored 

Coupling 

Strength 

Criterion 

Absolute 

Cheap, easy 

Too  sensitive 

Local  average 

Adapts  to  sub¬ 
system  structures 

False  alarms 

Splintering 

Global  average 

Adaptable 

Poor  for  inhomogeneous 
systems 

Algorithm 

Top-down,  divisive 

Can  reveal  high 
level  structures 
more  reliably 

Often  computationally 
costly 

Bottom-up, 

agglomerative 

Often 

computationally 

inexpensive 

More  likely  to 
reveal  small 
reusable  modules 

Very  sensitive  to  measure 
choice 

Table  1:  Properties  of  Cluster  Analysis  Algorithms 


security  and  safety  analyses  where  rigid 
criteria  are  necessary  to  provide  high 
assurance.  Our  second  generation 
technology  will  use  all  five  measures  in 
Table  1  to  more  fully  explore  the  potential  of 
this  approach.  We  will  explore  the  use  of  a 
coupling  strength  criterion  that  uses  the 
average  coupling  of  the  group  gathered  so 
far.  Prom  experience  in  other  venues  t9], 
this  method  works  well  when  there  are  clear 
and  easily  discriminable  clusters  in  the 
relationships. 

A  complementary  approach  identifies  sub¬ 
systems  within  the  larger  system  that  are 
locally  more  coupled  or  cohesive  within  the 


group  and  thereby  less  complexly  related  to 
the  rest  of  the  system.  Instead  of  looking  for 
sub-systems  with  little  or  no  coupling  to  other 
sub-systems,  we  look  for  graph  partitions 
with  maximal  internal  cohesion  or 
integrity.  This  difference  is  an  essential 
point. 

Previous  researchers,  including  ourselves, 
looked  for  sub-systems  that  were  coupled  to 
the  rest  of  the  system  by  a  specific  strength  or 
less,  that  is,  the  focus  is  on  the  relative  pair¬ 
wise  coupling  strength.  We  now  believe  that 
it  is  as  important  to  use  inverse  logic  -  look 
for  sub-systems  that  are  inherently  more 
coupled  to  themselves  than  to  the  rest  of  the 
system.  This  is  a  much  harder  but,  we 
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Goal 

Method 

Strength 

Weakness 

Complexity 

Identification 

Decoupled  groups 

Corresponds  to 
method  goals 

Few  programs  are  good  enough 

Maximally 
coupled  groups 

Works  on  any 
program 

May  not  map  easily  to  design 
or  method  goals 

Algorithms  are  difficult 

Complexity 

Reduction 

Isolated 

Corrections 

Cheap,  easy, 
effective 

May  not  provide  essential 
simplification 

Sets  of  changes 

Supports 

cost/benefit 

analyses 

Algorithms  are  difficult 

Table  2:  Pattern  Analysis  Methods  for  Managing  Complexity 


believe,  a  more  promising  technique.  We 
have  designed  a  top-down,  divisive  method 
that  uses  local  average  coupling  strength 
and  any  of  the  five  measures  of  coupling 
from  Table  1.  This  method  tries  to  find  the 
lowest  complexity  boundaries  with  respect  to 
the  complexity  of  the  groups  formed  thus  far, 
rather  than  with  respect  to  a  context- 
insensitive  definition  of  low  complexity.  It 
will  use  variations  of  the  Kernighan  and 
Lin  [10]  clustering  methods. 

Complexify  Reduction 

The  other  two  methods  in  Table  2  are  the 
complexity  reducing  analyses  to  identify 
couplings  that  can  easily  be  modified  or 
removed  in  an  attempt  to  reduce  complexity. 
This  requires  specific  information  about 
Ada  language  structures,  and  one 
illustrative  result  will  be  described  later. 
However,  once  these  methods  remove  the 
clutter,  there  remains  some  essential 
complexity  that  may  take  much  more  effort  to 
remove.  We  have  some  proposed  algorithms 
to  suggest  sets  of  changes  and  guide  decision 
making.  These  algorithms  will  be 
implemented  and  tested  in  our  second 
generation  analysis  tool  set 

Our  approach  is  unique,  but  not  novel  (Table 
3).  It  is  related  to  methods  described  in  the 
literature  for  some  time,  but  no  one,  to  our 
knowledge,  has  developed  a  broadly 
applicable,  high  level  analysis  technique. 
The  traditional  statement  level,  detailed 


methods,  such  as  Halstead  and  McCabe, 
provide  statistical  characterizations,  but 
little  help  for  modification  or  repair 
strategies. 

Our  approach  instead  builds  upon  the  module 
level  definition  of  complexity  such  as  that 
pioneered  by  Parnas  and  Myers.  It  is  our 
goal  to  use  this  sense  of  inter-module 
complexity  to  characterize  whole  systems  so 
we  can  identify  locally  complex  subsystems 
within  that  whole  that  should  be  examined 
more  closely.  Yau,  Belady,  and  Card 
provide  early  examples  of  applications  of 
this  concept  which  we  have  extended  and 
modified  to  be  practical  for  large  Ada 
systems. 

To  do  these  analyses  requires  an  extensive 
automated  analysis  capabilities  to  collect  the 
information  from  large  Ada  programs,  trace 
the  desired  relationships,  and  identify  the 
complexity  relationships  of  interests. 
Figure  2  illustrates  the  architecture  of  our 
second  generation  analysis  technology 
(known  as  ADAPT)  now  under 
development. 

We  ensure  that  the  ADAPT  analysis  system 
will  analyze  large  Ada  codes  by  using  a 
Diana-based  tool  from  the  STARS  program 
for  semantic  analysis  of  the  full  Ada 
language.  This  tool  extracts  the  relevant 
information  and  stores  it  in  a  relational 
database.  Many  simple  questions  can  be 
answered  by  relational  queries,  but  the 
structural,  pattern  related  questions  require 
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further  analysis.  We  found  that  exception  other  relationships,  but  otherwise  the 

raising  and  propagating  relationships  technology  is  similar, 

require  more  complex  infrastructure  than 


Mediod 

Level 

Approadi 

Strength 

Weakness 

Grumman  Pattern 
Analysis 

System 

Multiple 

Adaptive 

Less  mature 

Design  Stability  (Yau 
[4]) 

System 

Modifiability 

Sub-system 

focus 

Single  attribute 

Data  Binding  (Belady 
[5]) 

System 

Maintainability 

Design  Complexity 
(Card  [6]) 

System 

Design  Assessment 

Information  Hiding 
(Pamas  [7]) 

Module 

Complexity  Control 

Well 

understood 

Local  evaluations 

Coupling  (Myers  [8]) 

Module 

Adamat/Logpscope 

Detail 

Multiple  Statistical 
Characterizations 

Mgmt.  Tool 

Limited 

applications 

McCabe/Halstead 

Detail 

Statistical 

Characterization 

Code 

Sensitive 

Context 

insensitive 

Table  3:  Comparison  of  Complexity  Characterization  Methods 
_  ,  ^  ,  suggests  that  the  exception  handling  logic 

When  we  analyze  soaware,  we  produce  not  not  be  fully  developed.The  complete 

only  insights  into  the  general  complexity  of  a  analysis  is  about  100  pages 

the  software  and  information  on  what  ^oups  of  information.  Although,  for  research 

are  coupled  to  others,  but  the  analysis  ^so  purposes,  we  usually  analyze  a  system  to  see 

provides  many  helpful  listing  as  a  by-  whatever  can  be  seen,  we  recognize  that  a 

product,  such  as  variables  that  can  be  ^ould  have  specific  applications  in 

constants,  ^necessary  laifA  statements,  etc.  and  would  require  a  much  more 

For  example,  the  analysis  of  the  exception  analysis.  Examples  of  such  role 

handling  meth^s  of  one  20,000  line  system  differences  are  illustrated  in  Table  5. 
provided  detailed  tables  of  information,  of 

which  Table  4  is  an  extract.  The  capability  To  address  these  different  roles,  we  have 

for  a  full  static  trace  of  exception  taken  care  that  all  grouping  algorithms 

propagations  will  be  available  in  our  new  retain  the  linkages  from  the  low  level  details 

technology,  but  even  limited  statistics  like  that  caused  the  groupings  to  the  high  level 

this  can  be  helpful.  For  example,  this  groups  that  result.  This  permits  us  to  support 

program  has  69  when  others,  which  seems  the  more  detailed  requirements  of  the 

rather  high.  Further,  four  of  its  others  verifier  or  maintainer  that  need  specific 

handlers  are  null,  as  are  18  of  the  named  information  on  cause  and  effect, 

exception  handlers.  This  information 
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Attzibute 

Count 

Raise  Stmts 

Specific 

Handlers 

Pre-defined  exceptions  explicitly  raised 

2 

4 

63 

User-defined  exceptions  raised 

29 

80 

145 

User  exceptions  defined  but  not  used 

1 

Table  4;  Example  of  exception  information 


User  Role 

Possible  Goal 

Application 

Program  Office 

Software  Architecture 
Assessment 

Risk  Assessment 

Reuse  Potential 

Engineering  Management 

Design  Evaluation 

Complexity 

Modularity 

Maintainability 

Verifier 

Property  Certification 

Security 

Safety 

Fault  Tolerance 

Maintainer 

Re-engineering  Analysis 

CostiTBenefit  Evaluation 

Identification  of  sub-systems 

Table  5:  Potential  Applications 


Results 

A  sample  of  the  Ada  software  analyses  we 
have  done  so  far  is  in  Table  6. 

In  all  cases,  we  found  that  the  patterns  that 
were  perceivable  by  the  data  references 
criteria,  bottom-up,  decoupled  grouping 
method  were  generally  congruent  with  the 
software  designer’s  intent.  On  the  other 
hand,  we  found  many  examples  in  reputedly 
well  constructed  code  where  pervasive  inter¬ 
module  coupling  was  apparently 
inadvertently  introduced  during  the 
software  construction  process.  Deciding 
which  couplings  could  be  most  profitably 
removed  was  not  easily  accomplished  with 
our  first  generation  tool. 

Some  code  systems  (e.g.,  CAIS)  were 
specifically  designed  with  object-oriented 
paradigms,  while  others  were  designed  with 


functional  abstractions.  Although  we  saw 
significant  differences  in  the 
interconnection  patterns,  there  were  also 
significant  similarities.  This  suggests  that 
some  pattern  analysis  techniques  are 
insensitive  to  the  particular  design  and 
development  paradigm. 

In  five  of  these  eight  examples,  we  presented 
our  analysis  results  to  representatives  of  the 
original  implementation  team.  These 
conversations  confirmed  that  the  insights 
provided  by  the  techniques  are  useful  to  the 
developers  and  suggested  many  ideas  for 
potentially  useful  patterns.  Note  that  our 
technology  does  not  produce  patterns  or 
structures,  but  instead  reveals  structures 
created  during  software  design  and 
development.  In  one  example,  the  user  was 
interested  in  assessing  the  reuse  potential  of 
a  software  system.  Our  pattern  analysis 
technique  suggested  many  places  where 


extensive  data  coupling  made  reuse 
difficult.  It  is  significant  that  the  user  did 
not  say  “Find  all  places  where  data  coupling 
is  a  problem,”  but  rather  the  analysis 
technique  directed  attention  to  this  system 
aspect. 

In  another  example,  several  related  code 
systems  were  re-engineered  to  increase  the 
number  of  shared  modules.  We  applied  our 
analysis  technology  to  before  and  after 
versions  of  the  re-engineered  software. 
Although  the  number  of  common  modules  in 
the  re-engineered  systems  increased  by 
about  65%,  our  analysis  also  showed  that  the 
number  of  inter-module  references  {with 
statements)  increased  by  153%.  This 
suggested  (for  this  particular  experiment) 
that  the  benefits  of  increased  commonality 
were  offset  by  a  significant  increase  in  the 
code  system  complexity.  Although  our 
technique  cannot  decide  whether  this  is  a 
favorable  or  unfavorable  result,  it  can  reveal 
the  unintended  effects  of  the  re-engineering 
effort. 


Another  frequent  observation  was 
unnecessary  complexity  in  many  large 
software  systems  that  were  developed  under 
alleged  rigorous  methods  and  2167A 
documentation  requirements.  For  example. 
Figure  3  is  a  graph  of  with  relationships  in  a 
45,000  line  program.  In  this  example,  many 
with  relationships  have  the  pattern 
illustrated  in  Figure  4.  Here,  A  withs  B  and 
C,  and  B  also  withs  C.  In  some  cases,  A’s 
direct  reference  to  C  is  a  violation  of  the 
abstraction  provided  by  B,  and  in  other  cases 
it  is  not.  However,  Ada’s  rename  sstatement 
permits  an  easy  graph  simplification  that 
removes  the  direct  reference  from  A  to  C,  and 
instead  presents  th  ^  C  interface  within  the 
specification  of  B.  This  simple  change 
transforms  Figure  3  into  Figure  5.  This  does 
not  imply  that  it  is  always  wise  to  accept  all 
these  transform  suggestions,  but  the 
magnitude  of  the  potential  simplification 
warrants  serious  attention  to  the 
suggestions. 


Name 

Type 

Lines  of  Code 

GIFRS 

Simulation 

19,100 

TASKIT 

Simulation 

50,700 

SGML 

Text  Processing 

5,400 

ADAPT 

Program  Analysis 

45,100 

Navy  Code 

Message  Handling. 

31,700 

CAIS 

File  System  Interface 

359,800 

DIANA 

Language  Translation 

38,200 

SPCCode 

Compilers 

63,500 

Table  6:  Analysis  Examples 


As  an  unanticipated  by-product  of  our 
analysis  tools,  we  found  severe,  such 
simplifying  transforms  that  are  very  helpful 
to  remove  some  of  the  more  careless 
complexity  raising  dependencies.  These 


include  variables  that  can  be  constants, 
variables  that  should  be  moved  to  another 
package,  useless  with  statements,  etc.  As 
part  of  our  complexity  reducing  algorithms 
for  the  second  generation  analysis 
technology,  we  will  be  exploring  algorithms 
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of  a  35,000  line  subset  of  a  large  military 
command  and  control  system. 

Typically,  if  we  see  a  reasonably  organized 
call  graph  such  as  this,  the  inter-package 
data  reference  graph  also  is  relatively 
orderly.  This  is  not  surprising,  because 
most  detailed  design  methods  emphasize 
reduced  data  coupling  between  modules  and 
some  form  of  successive  refinement  of 
designs  that  tends  to  produce  relatively 


orderly  hierarchical  call  graphs.  In 
contrast.  Figure  7  shows  the  inter-package 
type  reference  graph  of  the  same  program.  It 
is  much  less  regular,  and  more  complex.  It 
also  has  many  more  packages  in  it,  because 
many  packages  contain  types  but  no  call 
references.  This  result  ’S  true  for  all 
programs  we  analyzed  so  far:  that  the  type 
reference  graph  graph  is  much  more 
complex  than  the  data  or  call  reference 
graphs.  The  only  exceptions  are  a  few 


\ 


Fig.  6  Package  to  Package  Call  Graph 


programs  whose  call  and  data  reference 
graphs  are  unusually  complex  and  the  type 
reference  graph  is  equally  so,  suggesting 
poor  design  or  implementation  problems. 

We  can  not  yet  suggest  movement  of  type 
definitions  from  package  to  package  to 
determine  how  much  of  this  additional 
complexity  is  superficial  and  what  is 


essential.  However,  these  initial 
observations  suggest  that,  while  modern 
methods  can  control  the  complexity  of  data 
and  call  references,  they  provide  much  less 
control  of  the  type  definition  and  reference 
structure  complexity. 
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Summary 

In  summary,  we  believe  we  have  developed 
an  innovative  technique  to  assess  the 
complexity  of  large  software  systems  and  to 
help  reduce  or  manage  that  complexity.  The 
pattern  analysis  technique  not  only  helps 
identify  what  is,  but  also  can  help  guide 


modifications  toward  what  ought  to  be. 
Although  the  technique’s  true  strengths  and 
weaknesses  are  still  unknown,  we  are 
convinced  that  there  are  many  effective 
applications  for  managing  complexity 
growth  or  reducing  the  complexity  in 
software  systems. 


Fig.  7  Package  to  Package  IVpe  Reference  Graph 

evaluating  the  relative  efficacy  of  proposed 
software  development  methods  by 
comparing  the  relative  complexity  of  their 
products.  This  application  could  be  a  major 
step  toward  engineering  new  and  improved 
software  processes. 


Our  focus  for  now  will  remain  on  assessing 
the  quality  of  systems  and  helping  guide 
complexity  reducing  efforts.  However,  we 
expect  that  more  advanced  applications  of  the 
technique  should  permit  us  to  predict  the 
probable  complexity  of  solutions  while  still 
in  the  design  stage.  This  should  help  us 
estimate  the  life  cycle  cost  of  software  and 
should  be  a  major  step  toward  managing  the 
life  cycle  cost  of  software,  **  .her  than  just 
trying  to  control  it.  Even  more  attractive,  but 
at  the  horizon  for  now,  is  the  potential  for 
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Methodology  for  Validating  Software  Metrics 
Norman  F.  Schneidcwind 
Code  AS/Ss 
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Abstract  Monterey,  CA  93943 

We  propose  a  comprehensive  metrics  validation  methodology  that  has 
six  validity  criteria,  which  support  the  quality  functions  assessment, 
control  and  prediction,  where  quality  functions  are  activities  conducted 
by  software  organizations  for  the  purpose  of  achieving  project  quality 
goals.  Six  criteria  are  defined  and  illustrated:  association, 
consistency,  discriminative  power,  tracking,  predictability  and 
repeatability.  We  show  that  non-parametric  statistical  methods  like 
contingency  tables  play  an  important  role  in  evaluating  metrics  against 
the  validity  criteria.  Examples  emphasizing  the  discriminative  power 
validity  criterion  are  presented.  A  metrics  validation  process  is 
defined  that  integrates  quality  factors,  metrics  and  quality  functions. 

Index  Terms:  metrics  validation  methodology,  metrics  validation  process, 
non-parametric  statistical  methods,  quality  functions,  validity 
criteria . 


INTRODUCTION 

We  believe  that  software  metrics  should  be  treated  as  part  of  an 
engineering  discipline:  metrics  should  be  evaluated  (validated)  to 
determine  whether  they  measure  what  they  purport  to  measure  prior  to 
using  the  metrics.  Furthermore,  if  metrics  are  to  be  of  greatest 
utility,  the  validation  should  be  performed  in  terms  of  the  quality 
functions  (quality  assessment,  control  and  prediction)  that  the  metrics 
are  to  support. 

We  propose  and  illustrate  a  validation  methodology  whose  adoption,  wo 
believe,  would  provide  a  rational  basis  for  using  metrics.  This  is  a 
comprehensive  metrics  methodology  that  builds  on  the  work  of  others. 
These  have  been  validation  analyses  performed  on  specific  metrics  or 
metric  systems  for  the  purpose  of  satisfying  specific  research  goals. 
Among  these  validations  are  the  following:  1)  function  points  as  a 
predictor  of  work  hours  across  different  development  sites  and  sets  of 
data  [1];  2)  reliability  of  metrics  data  reported  by  programmers  [3],-  3) 
Halstead  operator  count  for  Pascal  programs  [10];  4)  metric-based 
classification  trees  [16];  5)  evaluation  of  metrics  against  syntactic 
complexity  properties  [17]. 

Our  approach  to  validation  has  che  following  characteristics:  1)  The 
methodology  is  general  and  not  specific  to  particular  metrics  or 
research  objectives.  2)  It  is  developed  from  the  point  of  view  of  the 
metric  user  (rather  than  the  researcher),  who  has  requirements  for 
assessing,  controlling  and  predicting  quality.  To  illustrate  the 
difference  in  viewpoint,  we  can  make  an  analogy  with  the  automobile 
industry:  the  manufacturer  has  an  interest  in  brake  lining  thickness,  as 
it  relates  to  stopping  distance,  but  from  the  driver's  perspective,  the 
only  meaningful  metric  is  stopping  distance!  3)  It  consists  of  six 
mathematically  defined  criteria,  each  of  which  is  keyed  to  a  quality 
function,  so  the  user  of  metrics  can  understand  how  a  characteristic  of 
a  metric,  as  revealed  by  validation  tests,  can  be  applied  to  measure 
software  quality.  4)  The  six  criteria  are:  association,  consistency, 
discriminative  power,  tracking,  predictability  and  repeatability.  5)  It 
recognizes  that  a  given  metric  can  have  multiple  uses  (e.g.,  assess. 


171 


control  and  predict  quality)  and  that  a  given  metric  can  be  valid  for 
one  use  and  invalid  for  another  use.  6)  It  defines  a  metrics  validation 
process  that  integrates  quality  factors,  metrics  and  functions. 

The  paper  is  organized  as  follows:  First,  a  framework  is  established 
which  pulls  together  the  concepts  and  definitions  of  quality  factor, 
quality  metric,  validated  metric,  quality  function,  validity  criteria, 
and  a  metrics  validation  process.  These  concepts  and  definitions  are 
integrated  by  the  use  of  a  metrics  validation  process  chart.  In  this 
section  we  show  how  validity  criteria  support  quality  functions.  Next, 
we  indicate  why  non-parametric  statistical  methods  are  applicable  to  and 
compatible  with  the  validity  criteria.  This  is  followed  by  an  example  of 
metrics  validation,  using  the  discriminative  power  validity  criterion. 
Lastly,  some  comments  are  made  about  future  research  directions - 

FRAMEWORK 

The  framework  of  our  metrics  methodology  consists  of  the  following 
elements,  which  are  keyed  to  Figure  1:  quality  factor,  quality  metric, 
validated  metric,  quality  functions,  validity  criteria,  and  metrics 
validation  process.  In  Figure  1,  we  use  the  notation  [Project ,  Time, 
Measurement ]  to  designate  the  project,  time  (e.g.,  life  cycle  phase)  and 
type  of  measurement  (quality  factor,  quality  metric).  We  use  V  to 
designate  the  project  in  which  a  metric  is  validated  and  A  to  designate 
the  project  in  which  the  metric  is  applied. 

This  diagram  is  interpreted  as  follows: 

o  The  events  and  time  progression  of  the  validation  project  arc 

depicted  by  the  top  horizontal  line  and  arrow.  This  time  line 
consists  of  Project  1  with  metric  M  collection  in  Phase  T1  (step  1); 
factor  F  collection  in  Phase  T2  (step  2);  and  validation  of  M  with 
respect  to  F  in  Phase  T2  (step  3). 

o  The  events  and  time  progression  of  the  application  project  are 

depicted  by  the  bottom  horizontal  line  and  arrow.  This  project  is 
later  in  chronological  time  than  the  validation  project  but  has  the 
same  phases  T1  and  T2.  This  time  line  consists  of  Project  2  with 
metric  collection  M'  in  Phase  Tl  (step  4);  application  of  M'  to 
assess,  control,  and  predict  quality  in  Phase  Tl  (step  5);  collection 
of  factor  F’  in  Phase  T2  (step  6);  and  revalidation  of  M  and  M'  with 
respect  to  F  and  F'  in  Phase  T2  (step  7). 

o  Metric  M'  is  the  same  metric  as  M  but,  in  general,  it  has  different 

values  since  it  is  collected  in  a  different  project.  The  same 
statement  applies  to  F'  and  F. 

Each  element  is  defined  and  described  in  greater  detail  in  the 
following  sections. 

Quality  Factor 


A  quality  factor  F  (hereafter  referred  to  as  "factor"  or  "F")  is  an 
attribute  of  software  that  contributes  to  its  quality  [13],  where 
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software  quality  is  defined  as  the  degree  to  which  software  possesses  a 
desired  combination  of  attributes  [14]-  For  example,  reliability  (an 
attribute  that  contributes  to  quality)  is  a  factor.  A  factor  can  have 
values,  such  as  the  error  counts  in  a  set  of  software 
components  (i.e.,  an  element  of  a  software  system,  such  as  module,  unit, 
data  or  document  [13]).  We  define  F  to  be  a  type  of  metric  that  provides 
a  direct  measure  of  software  quality  [6].  This  means  that  F  is  an 
intrinsic  indicator  of  quality  as  perceived  by  the  user,  such  as  errors 
in  the  software  that  result  in  failures  during  operation.  We  denote  F  as 
the  factor  in  V  and  F'  as  the  factor  in  A.  F  and  F'  are  shown  as 
collected  at  point  2  and  at  point  6,  respectively,  in  Figure  1. 

Quality  Metric 


A  quality  metric  M  (hereafter  called  "metric"  or  "M" )  is  a  function 
(e.g.,  cyclomatic  complexity  M  =  e  -  n  +  2p)  whose  inputs  are  software 
data  (elementary  software  measurements,  such  as  number  of  edges  e  and 
number  of  nodes  n  in  a  directed  graph)  and  whose  output  is  a  single 
numerical  value  M  that  can  be  interpreted  as  the  degree  to  which 
software  possesses  a  given  attribute  (cyclomatic  complexity)  that  may 
affect  its  quality  (e.g.,  reliability)  [15].  For  example,  if  there  are 
two  components  1  and  2  with  =  3  and  M2  =  10,  this  may  indicate  that 
the  reliability  of  1  may  be  greater  than  the  reliability  of  2.  Whether 
this  is  the  case  depends  upon  whether  M  is  a  valid  metric  (see  below). 
We  define  M  to  be  an  indirect  measure  of  software  quality  [2,6].  This 
means  that  M  may  be  used  as  a  substitute  for  F,  when  F  is  not  available, 
as  is  the  case  during  the  design  phase.  M  is  shown  as  collected  at  point 
1  in  Figure  1. 

It  is  important  to  recognize  that,  in  general,  there  can  be  a  many- 
to-many  relationship  between  F  and  M.  For  expository  purposes  we  limit 
our  examples  to  one-to-one  or  one  (F)  to  many  (M)  relationships. 

Validated  Metric 


A  validated  metric  is  one  whose  values  have  been  shown  to  be 
statistically  associated  with  corresponding  factor  values  (e.g..  Mi, 
. . . ,M„  have  been  statistically  associated  with  Fi ,  ...,Fn  for  a  set  of 
software  components  1,  ...,n)  [13].  A  validation  test  of  M  with  respect 
to  F  is  shown  at  point  3  in  Figure  1.  We  denote  M'  as  a  validated 
metric.  Since  M  is  validated  with  respect  to  F,  it  is  necessarily  the 
case  that  F  is  valid.  Therefore  we  say  that  F  is  valid  by  definition,  as 
a  result  of  wide  acceptance  or  historical  usage  (e.g.,  error  count). 

Since  F  is  a  direct  measure  of  quality,  it  is  preferred  to  M  whenever 
it  is  possible  to  measure  F  sufficiently  early  in  the  life  cycle  to 
permit  quality  to  be  assessed,  controlled  and  precicted  (see  below). 
However,  since  this  is  usually  not  the  case,  the  need  for  validation 
arises.  We  also  note  that  since  the  cost  of  finding  and  correcting 
errors  grows  rapidly  with  the  life  cycre,  it  is  advantageous  to  have 
approximate  early  (leading)  indicators  of  software  quality, 
(Analogously,  one  could  posit  that  the  Dow  Jones  stock  price  average  (M) 
is  an  approximate  leading  indicator  of  Gross  National  Product  (F)  in  the 
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VALIDATION  (V)  PROJECT  1  [Pl]  TIME  LINE 


1 .  Collect  M 


2 .  Collect  F 

3.  Validate  M  with  F 


VtPl,  Tl,  Ml 


V[P1,  T2,  F] 


PHASE  T1 


PHASE  T2 


APPLICATION  (A)  PROJECT  2  tP23  TIME  LINE 


4.  Collect  M' 

5.  Apply  M*  to: 

Assess,  Control,  Predict 


6.  Collect  F' 

7.  Revalidate  M,  M* 
with  F,  F' 


A[P2,  Tl,  M'] 


A[P2,  T2,  F*] 


Figure  1.  Metrics  Validation  Process 
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American  economy  and  conduct  a  validation  test  between  the  two).  Thus, 
we  can  formulate  the  following  policy  with  respect  to  software 
measurement:  When  it  is  feasible  to  measure  and  apply  F,  use  it; 
otherwise,  attempt  to  validate  M  with  respect  to  F  and,  if  successful, 
use  M'  . 

Quality  Functions 


Quality  functions  are  activities  conducted  by  software  organizations 
for  the  purpose  of  achieving  project  quality  goals.  Both  product  and 
process  goals  are  included.  The  quality  functions  that  are  pertinent  to 
this  metrics  methodology  are:  assessment,  control  and  prediction. 

Quality  Assessment 

Quality  assessment  is  the  evaluation  of  the  relative  quality  of 
software  components.  "Relative  quality"  is  the  quality  of  a  given 
component  compared  with  the  quality  of  other  components  in  the  set 
(e.g.,  if  M'  is  cyclomatic  complexity,  the  quality  of  component  1,  with 
M'  =  3,  may  be  better  than  the  quality  of  component  2,  with  M'  =  10). 
Validated  metrics  are  used  to  make  a  relative  comparison  of  the  quality 
of  software  components.  The  purpose  of  assessment  is  to  provide  software 
managers  with  a  rational  basis  for  assigning  priorities  for  quality 
improvement  and  for  allocating  personnel  and  computer  resources  to 
quality  assurance  functions.  For  example,  priorities  and  resources  would 
be  assigned  on  the  basis  of  relative  values  (or  ranks)  of  M'  (i.e.,  the 
most  resources  would  be  assigned  to  the  components  with  the  highest 
(lowest)  values  (or  ranks)  of  M' )  .  M'  is  shown  collected  at  point  4  in 
Figure  1  and  used  for  assessment  at  point  5. 

Quality  Control 

Quality  control  is  the  evaluation  of  software  components  against 
predetermined  critical  values  of  metrics  (i.e.,  value  of  M'  which  is 
used  to  identify  software  which  has  unacceptable  quality  [13])  and  the 
identification  of  components  that  fall  outside  quality  limits.  We  denote 
M’o  as  the  critical  value  of  M'  .  Validated  metrics  are  used  to  identify 
components  with  unacceptable  quality.  The  purpose  of  control  is  to  allow 
software  managers  to  identify  software  that  has  unacceptable  quality 
sufficiently  early  in  the  development  process  to  take  corrective  action. 
For  example,  =  3  would  be  used  as  a  critical  value  of  cyclomatic 
complexity  to  discriminate  between  components  that  contain  errors  and 
those  that  do  not. 

Control  also  involves  the  tracking  of  the  quality  of  a  component  over 
its  life  cycle.  For  example,  if  M'  is  cyclomatic  complexity,  an  increase 
from  3  to  10,  as  the  result  of  a  design  change,  would  be  used  to 
indicate  possible  degradation  in  quality.  M'  is  shown  as  collected  at 
point  4  in  Figure  1  and  used  for  control  at  point  5 . 

Quality  Prediction 

Quality  prediction  is  a  forecast  of  the  value  of  F  at  time  T2  based 
on  the  values  of  M'l,  M'a,  for  components  1,  2,  ...,n  at  time 
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Tl,  where  "time"  could  be  computer  execution  time,  labor  time  or 
calendar  time.  Validated  metrics  (e.g.,  size,  complexity)  are  used 
during  the  design  phase  to  make  predictions  of  test  or  operational  phase 
factors  (e.g.,  error  count).  The  purpose  of  prediction  is  to  provide 
software  managers  with  a  forecast  of  the  quality  of  the  operational 
software  and  to  flag  components  for  detailed  inspection  whose  predicted 
factor  values  are  greater  than  (or  less  than)  the  target  values 
(determined  from  requirements  analysis).  M'  is  shown  as  collected  at 
point  4  in  Figure  1  and  used  for  prediction  at  point  5. 

Validity  Criteria 


Validity  criteria  provide  the  rationale  for  validating  metrics,-  they 
are  the  specific  quantitative  relationships  that  are  hypothesized  to 
exist  between  factors  and  metrics.  Validity  criteria,  in  turn,  are  based 
on  the  principle  of  validity ,  which  defines  the  general  quantitative 
relationship  between  factors  and  metrics  that  must  exist  for  the 
validity  criteria  to  be  applied.  First  we  provide  definitions  relating 
to  the  principle  of  validity.  Then  we  define  the  principle  of  validity. 
Last  we  define  each  validity  criterion  and  provide  an  example  of  its 
application. 

Definitions: 


RiM]:  Relation 

R 

on 

vector 

M 

fcr  V[P1,  Tl,  M] 

(1) 

R[F3:  Relation 

R 

on 

vector 

F 

for  V[P1,  T2,  F) 

(2) 

R[M' ] : Relation 

R 

on 

vector 

M' 

for  A(P2,  Tl,  M’ ] 

(3) 

R[F ' ] iRelation 

R 

on 

vector 

F’ 

for  A[P2,  T2,  F' ] 

(4) 

where  R  could  be,  for  example,  an  order  relation  like: 

Magnitude [ Ml <M2 ... <M„]  and  Magnitude[Fi<F2 . .  .  <F„ ]  involving  n  values 
(data  points)  for  M  and  F. 

Principle  of  Validity: 

IF  R[M]  <=>  R[F] 

is  validated  statistically  with  confidence  level  a  and,  for  certain 
validity  criteria,  with  threshold  value  Bi , 

THEN  {R[M]  <=>  R[F])  =>  {R[M’ ]  =>  R[F’]}?  (5) 

In  other  words,  does  the  mapping  M  <=>  F,  validated  on  Project  1, 
imply  a  mapping  M'  =>  F'  on  Project  2?  We  assume  (5)  to  be  true  at 
point  5  in  Figure  1.  Once  F'  is  collected  at  point  6,  we  revalidate  (or 
invalidate )  ( 5 )  by  repeating  the  validation  test  using  aggregated  M  and 
M'  validated  with  respect  to  aggregated  F  and  F'  at  point  7. 

We  note  that  a  metric  may  be  valid  with  respect  to  certain  validity 
criteria  and  invalid  with  respect  to  other  criteria.  Each  validity 
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criterion  supports  one  or  more  of  the  quality  functions  assessment, 
control  and  prediction,  which  were  described  abo^e.  The  validity 
:riteria  --  association,  consistency,  discriminative  power,  tracking, 
sredictability  and  repeatability  —  are  applied  at  point  3  of  Figure  1 . 
[■he  particular  criteria  that  are  used  depend  on  the  quality  functions 
lone  or  more)  that  are  to  be  supported. 

The  validation  procedure  requires  that  threshold  values  be 
selected  for  certain  validity  criteria.  The  criterion  used  for  selecting 
:hese  values  is  reasonableness  (i.e.,  judgement  must  be  exercised  in 
selecting  values  to  strike  a  balance  between  the  one  extreme  of  causing 
m  M,  which  has  a  high  degree  of  association  with  F,  to  fail  validation 
md  the  other  extreme  of  allowing  an  M  of  questionable  validity  to  pass 
validation )  . 

A  short  simple  numerical  example  follows  the  definition  of  each 
validity  criterion  for  the  purpose  of  illustrating  the  basic  concepts  of 
:he  validity  criteria.  For  illustrative  purposes,  F  is  error  count  and 
I  is  cyclomatic  complexity,  or  complexity  for  short,  in  the  examples, 
i^lso,  to  keep  the  examples  simple,  we  use  small  sample  sizes;  these 
sample  sizes  would  not  be  acceptable  in  practice.  As  noted  previously, 
jiven  {F}  and  {M},  it  is  possible  to  have  an  in  {M}  predict  multiple 

in  {F}  or  to  have  an  Fjl  in  {F}  predicted  by  multiple  Ms  in  {M}. 
Jowever,  in  order  to  simplify  the  examples,  only  the  one-to-one  case 
vill  be  illustrated. 

Association: 

The  variation  in  F  explained  by  the  variation  in  M,  which  is  given  by 

(coefficient  of  determination),  where  R  is  the  linear  correlation 
:oefficient,  must  exceed  a  specified  threshold,  or 

>  fia,  with  specified  a.  (6) 

This  criterion  assesses  whether  there  is  a  sufficient  linear 
issociation  between  F  and  M  to  warrant  using  M  as  an  indirect  measure  of 
'.  This  criterion  supports  the  quality  assessment  function  as  follows: 

If  the  elements  of  vector  M,  corresponding  to  components  1,2,  ...,n, 
re  ordered  by  magnitude,  as  illustrated  in  Table  1,  can  we  infer  a 
inear  ordering  of  F  with  respect  to  M  for  the  purpose  of  assessing 
ifferences  in  component  quality?  In  other  words  does  the  following 
old? 


'agnitude[n:L  <  Ma  .  .  .  <Mi  .  .  .  <  M„]  <=>  (7) 

agnitude  [  Fi  <  Fa . .  .  . .  .  <  F„i 

fc 

nd  (Mi^x  ~  Mi)  (Fa-kx  -  Fx)  for  i  =  l,2,...,n-l. 


The  data  of  Table  1  are  plotted  in  Figure  2  to  contrast  perfect  with 
roperfect  association. 
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Table  1 

(Validation  Project) 


Component 

M 

(Magnitude) 

M 

(Rank) 

F 

(Magnitude) 

F 

(Rank) 

1 

8 

1 

2 

1 

2 

10 

2 

6 

2 

3 

11 

3 

8 

4 

4 

14 

4 

7 

3 

Since  there  is  seldom  perfect  linear  magnitude  ordering  between  F  and 
M  (i.e.,  R=  1.0),  we  use  (6)  to  measure  the  degree  to  which  (7)  holds. 
For  example,  if  R  =  .9  and  a  =  .05,  then  81%  of  the  variation  in  F 
(error  count)  is  explained  by  the  variation  in  M  (complexity),  with  an 
acceptable  confidence  level.  If  this  relationship  is  demonstrated  over 
a  representative  sample  of  components,  and  if  has  been  established  as 
.7,  we  could  conclude  that  M  is  associated  with  F  and  can  be  used  to 
compare  magnitudes  of  complexity  obtained  from  different  components  to 
assess  the  degree  to  which  they  differ  in  quality  (e.g.,  the  difference 
in  complexity  magnitude  between  component  2  and 'component  1  (10  -  8)  is 
proportional  to  their  differences  in  quality  in  Table  1). 

The  resultant  M'  would  be  used  to  assess  differences  in  the  quality 
of  components  on  the  application  project. 

Consistency: 

The  ran):  correlation  coefficient  r  between  F  and  M  must  exceed  a 
specified  threshold,  or 

r  >  Be,  with  specified  a.  (8) 

This  criterion  assesses  whether  there  is  sufficient  consistency 
between  the  ranks  of  F  and  the  ranks  of  M  to  warrant  using  M  as  an 
indirect  measure  of  F  [9].  This  criterion  supports  the  quality 
assessment  function  as  follows: 


If  the  elements  of  vector  M,  corresponding  to  components  1,2,  ...,n, 
are  ordered  by  rank,  as  illustrated  in  Table  1,  can  we  infer  an  ordering 
of  F  with  respect  to  M  for  the  purpose  of  assessing  the  rank  order  of 
component  quality?  In  other  words  does  the  following  hold? 


Rank[Ma 

Rank[Fi 


<  M2. . .<Mi. . 

<  F2  .  .  .  .  . 

f 

The  data  of  Ta^le 


.  < 

.  < 


M, 

F, 


J  <= 

J 


(9) 


1  are  plotted  in  Figure  3  to  contrast  perfect  with 
imperfect  consistency  for  the  same  set  of  components. 


Since  there  is  seldom  perfect  rank  ordering  between  F  and  M  (i.e.,  r 
=  1.0),  VC  use  (8)  to  measure  the  degree  to  which  (9)  holds.  For 
sxample,  if  r  =  .8  and  a  =  .05,  there  is  an  80%  ranking  between  F  and 
yi,  with  an  acceptable  confidence  level.  If  this  relationship  is 


179 


:c>ff’CN€?<r 


demonstrated  over  a  representative  sample  oi  comporicnts ,  and  il  r,  nai. 
been  established  as  .7,  we  could  conclude  that  M  is  corssi  stent  with  1 
and  can  be  used  to  compare  ranks  of  complexity  obtained  from  dxl  feieiit 
components  to  assess  the  degree  to  which  they  differ  in  relative  quality 
(e.g.,  component  2  quality  is  lower  (higher  complexity)  i  hai.  cornyonfrir 
1  quality  in  Table  1 ) . 

The  resultant  M'  would  be  used  to  assess  relative  quality  c: 
components  on  the  application  project. 

Discriminative  Power: 

The  critical  value  of  a  metric  K..  must  b<‘  abi<i  t  di  ;:u;i  irniriat^- ,  1  o: 
a  specified  F'.,,  between  elements  (components  1  fj )  of  vector 
F  117],  in  the  following  way: 

Ml,  >  M,-,  <=>  F,  >  F. .  and  (10) 
Mi  <  M,,,  <==>  F,  <  F.. 

for  i  =  with  specified  a. 

This  criterion  assesses  whether  M,.  his  sufficient  discriminative 
power  to  warrant  using  it  as  an  indirect  measure  of  F.  .  This  criterion 
supports  the  quality  control  function  as  follows: 

Would  M,.. ,  as  illustrated  in  Table  2,  partition  F,  for  a  specified  F.- . 
as  defined  in  (10)?  For  example,  the  data  from  Tabic  1  is  used  in  Table 
2,  with  M,-  =  10  and  F,^  =  2.  We  see  that  discriminative  power  is  not 
perfect  in  Table  2  (i.c.,  O.. ,  /  0).  If  it  is  desired  to  flag  compenentr 
with  more  than  two  errors  (F  >  F,.)  for  detailed  inspection,  and  if  M',. 
=  10  (complexity)  is  validated,  it  would  be  used  on  the  application 
project  to  control  quality  (i.c.,  discriminate  between  acceptable  and 
unacceptable  components),  as  shown  in  Figure  4.  One  purpose  of  Figure  4 
is  to  identify  trends  in  quality  (e.g.,  a  persistent  case  of  component::- 
being  in  the  unacceptable  zone ) . 


Table  2 

(Validation  Project) 


=  10 

F«  =  2 

M  < 

M  >  M„ 

F  <  F,, 

0x1  =  1 

O 

!l 

o 

F  >  Fc. 

Ozi  =  1 

O2:.  =  2 

Oij  =  count  of  observations  in  cell  i,j. 
Oix,  O^i,:  correct  classifications. 

0x2/  O^j :  incorrect  classifications. 
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M 


M„ 

M^ 

Ml 

M^ 

M2 

M3 

Application 

Project  Time  - 

- > 

Unacceptable  Compo nc :: t, 
M'„  =  10 

Acceptable  Components 


Figure  4.  Application  of  Metrics  to  Quality  Control  (discriminative 
power)  for  Components  1 , 2 ,  .  • - , n 

Since  there  is  seldom  a  perfect  discriminator  for  F,,  (i.e.,  Oj ..  - 
O21  =  0  in  Table  2),  we  use  an  appropriate  statistical  method  (e.g., 
chi-square  contingency  table  [7,8,12])  and  representative  sample  of 
components  to  measure  the  degree  to  which  (10)  holds. 


Tracking : 


M  must  change 

in 

unison  with 

F, 

for  a 

given  component  i , 

at  times 

Ti / T2 / • . • 

T3,...,T, 

^  as  follows: 

Mx(Tj.x) 

>  Mx(Tj) 

<= 

=>  Fx(T^.x) 

> 

FxCT;,) 

and 

(ID 

Mx(Tj*x) 

=  Mx(Tj) 

<= 

==>  Fx(T:,^x) 

= 

Fx(Tj) 

and 

Mx(Tj*, ) 

<  Mx{T3) 

<= 

==>  Fx{Tj.D 

< 

Fx(TD 

with  specified  a. 

This  criterion  is  illustrated  graphically  in  Figure  5  to  contrast 
perfect  with  imperfect  tracking,  where  factor  and  metric  values  arc 
plotted  against  project  time. 

This  criterion  assesses  whether  M  is  capable  of  tracking  changes  in 
F  (e.g.,  as  a  result  of  design  changes)  to  a  sufficient  degree  to 
warrant  using  M  as  an  indirect  measure  of  F.  This  criterion  supports  the 
quality  control  function  as  follows: 

Would  changes  in  M  track  changes  in  F  as  defined  in  (ID?  If  M  is 
validated,  then  a  vector  M’x(Tj)  consisting  of  the  values 
MD(Tx),M’x(Ta),...,M’x(TD/---,MD(T„,)  of  component  i ,  measured  at 
times  Ti  ,T2  , .  .  .  ,Tj , . .  .  ,T„,  would  be  used  to  track  quality  on  the 
application  project.  For  example,  if  complexity  MD  is  valid  for 
tracking  error  count  F,  M'x  would  be  used  as  shown  in  Figure  6,  where 
quality  increases  from  Ti  to  T^,  stays  the  same  from  T^  to  T,,  and 
decreases  thereafter. 
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ing  Validity  Criteri 

^  (ComDor.Qnt  d) 


FIGURE  5 


M 


.1 


j 


M' 


M' 


i  M' 

M '  X  M '  ± 


Tl  T; 


Application  Project  Time 


> 


Figure  6.  Application  of  Metrics  to  Quality  Control  (tracking) 
for  Component  i  at  Times  l,2,,..,m 

Since  there  is  seldom  perfect  tracking  of  F  by  M,  we  use  an 
appropriate  statistical  method  (e.g.,  binary  sequences  test  [8])  and 
representative  sample  for  component  i  to  measure  the  degree  to  which 
(11)  holds. 


Predictability: 


A  function  of  M,  f(M),  where  M  is  measured  at  time  Tl,  must  predict 
F,  measured  at  time  T2 ,  with  an  accuracy  Bp,  or 


Fa^2 


< 


(12) 


where  Fa-rz  is  the  actual  value  and  Fptz  is  the  predicted  value. 


This  criterion  is  illustrated  graphically  in  Figure  7  to  contrast 
perfect  with  imperfect  prediction,  where  f(M),  formulated  at  Tl,  will 
either  turn  out  to  be  equal  to  Fa  at  T2  (perfect  Predictability),  or  be 
equal  to  Fp+  or  Fp-  (imperfect  Predictability), 


F 


f  (M) 


>  Fp+  Imperfect  Predictability 

>  Fa  Perfect  Predictability 

>  Fp-  Imperfect  Predictability 


Tl 


T2 


Application  Project  Time  - > 

Figure  7,  Application  of  Metrics  to  Quality  Prediction 
(Predictability)  for  a  Component 

This  criterion  assesses  whether  f(M)  can  predict  F  with  required 
accuracy.  This  criterion  supports  the  quality  prediction  function  as 
follows : 


If  (12)  holds,  would  the  following  hold? 
=>  FpVz  =  f(M'Ti) 


FPrr  2  ~  f  (  Mt  1  ) 
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(13) 


where  vector  =  [F^,  .  ,F„]a,-2  and  vector  =  [M^.,  ,  .  .  .  ,M„ 

for  components  l,2,...,n,  and  Fp't2  and  are  similarly  defined.  In 

other  words  do  we  have: 


Fa' 


T2  fp  '  -x-a 

Fa ' 


<  fi. 


(14) 


For  example,  if  a  function  f,  relating  error  count  with  complexity 
can  be  identified  (e.g.,  regression  analysis)  that  is  a  good  predictor 
of  F  (i.e.,  satisfies  (12)),  then  we  would  use  the  saune  f  as  the 
predictor  of  F'  to  predict  error  count  from  complexity  on  the 
application  project. 


Since  there  is  seldom  a  perfect  f,  (i.e.,  Fp^-z  =  Fa^^_j,  we  use  (12) 
to  measure  the  degree  to  which  f  predicts  F. 


Repeatability : 

The  success  rate  of  validating  W  for  a  given  validity  criterion  i 
must  satisfy: 


Nx^/Nx  >  Bx«  (15) 

where  Nx^  is  the  number  of  validations  of  M  for  criterion  i  and  is 
the  total  number  of  trials  for  criterion  i. 


This  criterion  assesses  whether  M  can  be  validated  on  a  sufficient 
percentage  of  trials  to  have  confidence  that  it  would  be  a  dependable 
indicator  of  quality  in  the  long  run.  We  use  "trials"  because  validation 
could  be  performed  with  respect  to  projects,  applications,  components, 
or  some  other  appropriate  entity. 

Metrics  Validation  Process 


Given  that  there  must  be  a  validation  project  V  and  an  application 
project  A,  as  shown  in  Figure  1,  this  requirement  gives  rise  to  what  we 
call  the  "fundamental  problem  in  metrics  validation".  This  problem 
arises  because  there  could  be  significant  time  lags,  product  and  process 
differences,  and  differences  in  goals  and  environments  [5]  between  the 
following  phases  of  the  validation  process  (see  Figure  1): 

1)  V[P1,  Tl,  M]  and  V[P1,  T2,  F] 

2)  V[P1,  T2,  F]  and  A[P2,  Tl ,  M’ j 

3)  A[P2,  Tl,  M']  and  A(P2,  T2,  F’J. 

An  important  characteristic  of  the  methodology  is  expressed  by  the 
following: 

IF  V[P1,  Tl,  M]  <==>  V(P1,  T2,  F]  (16) 

THEN  A[P2,  Tl,  M' ]  =>  A(P2,  T2,  F'j. 

From  (16)  it  follows  that  at  point  3  in  Figure  1,  M  is  validated  in 
V.  Whether  M'  will  actually  be  valid  in  A  will  not  be  known  until  point 


185 


7.  Thus  it  is  worthwhile  to  discuss  some  of  the  practical  difficulties 
of  adhering  to  (16)  and  possible  remedies. 


With  respect  to  1),  the  product  or  process  may  have  changed  so  much 
between  T1  and  T2  that  M  collected  at  T1  may  no  longer  be  representative 
of  F.  If  this  is  the  case,  M  should  be  collected  again  at  T2  to  validate 
against  F.  The  advantage  of  collecting  M  at  T1  is  that  it  may  be  easier 
and  less  expensive  than  at  T2  because  M  can  be  collected  as  a  by-product 
of  compilation  and  design  and  code  inspections. 

The  same  considerations  apply  with  respect  to  3)  except  now  the 
concern  is  with  whether  M'  collected  at  T1  should  be  used  for 
revalidation  at  T2 .  However,  note  that  it  is  mandatory  that  M'  be 
collected  at  T1  to  have  an  early  indication  of  possible  quality  problems 
(that  is  a  key  concept  of  our  methodology!). 

With  respect  to  2),  we  can  achieve  a  degree  of  stability  in  the 
validation  process  if  the  following  procedure  is  employed: 

a)  Select  V  and  A  to  be  as  similar  as  possible  with  respect  to 
application  and  development  environments. 

With  respect  to  1 ) ,  2)  and  3)  considered  jointly,  we  can  achieve  a 
degree  of  stability  in  the  validation  process  if  a)  is  employed  plus  the 
following  two  additional  procedures: 

b)  Select  the  same  life  cycle  phase  for  T1  in  V  and  A. 

c)  Select  the  same  life  cycle  phase  for  T2  in  V  and  A. 

We  recognize  that  it  may  be  infeasible  to  implement  a),  b)  and  c).  If 
this  is  the  case,  it  means  there  is  a  higher  risk  that  M  validated  at 

point  3  in  Figure  1  will  not  remain  valid  at  point  5, 

NON-PARAMETRIC  STATISTICAL  METHODS  FOR  METRICS  VALIDATION 

Non-parametric  statistical  methods  are  used  to  support  metrics 
validation  because  these  methods  have  important  advantages  over 

parametric  methods.  Indeed  it  would  be  infeasible  to  validate  metrics  in 
many  situations  without  their  use.  This  is  the  case  because  the 
assumptions  that  must  be  satisfied  to  employ  non-parametric  methods  arc 
less  demanding  than  those  that  apply  to  parametric  methods.  This  might 
lead  to  the  conclusion  that  non-parametric  methods  are  less  rigorous 
than  parametric  methods.  Despite  this  possible  perception,  non- 

parametric  methods  allow  us  to  develop  very  useful  order  relations 
concerning  the  relative  quality  of  components.  The  validity  criteria 
which  use  non-parametric  methods  are  shown  in  Table  3.  The  advantages  of 
non-parametric  methods  over  parametric  methods,  which  are  important  for 
metrics  validation,  are  the  following: 

o  Given  the  noisiness  of  metrics  data,  the  fact  that  the  assumptions 
are  less  restrictive  is  a  big  advantage. 

o  No  assumption  is  necessary  about  distribution  (e.g.,  data  does  not 
have  to  be  normally  distributed). 


186 


o  We  can  use  the  nominal  scale  (i.e.,  component  A  is  high  quality, 
component  B  is  low  quality)  and  location  statistics  like  the  median 
[11].  The  Discriminative  Power  validity  criterion  is  based  on  this 
measurement  property.  Similarly,  we  can  use  the  nominal  scale  to 
indicate  whether  an  incremental  change  in  a  metric  tracks  (yes,  no) 
an  incremental  change  in  a  factor.  The  Tracking  validity  criterion  is 
based  on  this  measurement  property. 

o  We  can  use  the  ordinal  scale  (i.e.,  component  A  is  higher  quality 
than  component  B)  and  order  statistics,  like  ranks.  The  Consistency 
validity  criterion  is  based  on  this  measurement  property.  For 
example,  ranks  of  random  variables  [3]  can  be  used  rather  than  the 
values  themselves,  thus  relaxing  the  assumptions  about  data 
relationships  (e.g.,  linearity)  while  providing  a  measure  of  quality 
(e.g.,  ranking  of  components)  that  is  useful  to  the  software  manager. 
In  other  words  the  fact  that  the  data  is  not  as  "well-behaved"  as  we 
might  believe  it  should  be  does  not  necessarily  mean  that  it  is  less 
useful.  In  fact,  when  we  consider  that  many  useful  applications  of 
metrics  can  be  derived  from  the  ability  to  classify  components  as 
being  "higher  quality"  or  "lower  quality",  we  realize  that  the 
information  provided  by  non-parametric  analysis  is  supportive  of  this 
approach . 

Despite  the  advantages  of  non-parametric  methods,  certain  validity 
criteria  lend  themselves  to  the  use  of  parametric  methods.  These  are 
shown  in  Table  3.  Association,  which  measures  the  difference  in 
component  quality,  uses  the  interval  scale.  Predictability  uses  the 
interval  scale  to  predict  a  factor  value  and  the  ratio  scale  for 
measuring  prediction  accuracy.  Lastly,  Repeatability  uses  the  ratio 
scale  for  measuring  metric  validation  success. 

Appendix  A  summarizes  quality  function,  validity  criterion,  purpose 
of  valid  metric,  and  statistical  method. 

Table  3 


Validity  Criteria  Properties 


Criterion 

Scale 

Method 

Measurement 

Property 

Association 

Interval 

f  '  ■  ' 

Parametric 

Difference 

Consistency 

Ordinal 

Non-parametric 

Higher/Lower 

Discriminative 

Power 

Nominal 

Non-parametric 

High/Low 

Tracking 

Nominal 

Non-Parametric 

Increment 

Predictability 

Interval,  Ratio 

Parametric 

%  Accuracy 

Repeatability 

Ratio 

Parametric 

%  Success 

EXAMPLE  OF  VALIDATING  METRICS 


The  following  example  is  provided  to  illustrate  the  validation  of  M 
with  F  and  the  identification  of  an  which  would  be  used  in  the 
quality  control  function.  Also  we  show  how  to  conduct  a  cost  sensitivity 
analysis  on  in  order  to  identify  its  optimal  value  (i.e.,  the  minimuim 
cost  Mo  across  a  range  of  assumptions  about  the  cost  of  using  M^)- 

The  data  used  in  the  example  validation  tests  were  collected  from 
actual  software  projects.  The  Discriminative  Power  validity  test  is 
illustrated. 

Purpose  of  Metrics  Validation 

The  purpose  of  this  validation  is  to  determine  whether  cyclomatic 
number  (complexity  (C))  and  size  (number  of  source  statements  (S)) 
metrics,  either  singly  or  in  combination,  could  be  used  to  control  the 
factor  reliability,  as  represented  by  the  factor  error  count  (E).  A 
sxammary  of  the  data  is  shown  in  Table  4  and  the  detailed  data  listing 
can  be  found  in  Appendix  B- 

Table  4 


Project  Application  Procedures 

Statements 

Errors 

(with 

errors ) 

1 

String  Processing 

11  (  5) 

136 

10 

2 

Directed  Graph  Analysis 

31  (12) 

430 

27 

3 

Directed  Graph  Analysis 

1  (  1) 

13 

1 

4 

Data  Base  Management 

69  (13) 

1021 

26 

112  (31) 

1600 

64 

Number  of  procedures:  112  total,  31  with  errors,  81  with  no  errors. 
Number  of  source  statements:  2007  total,  1600  included  in  metrics 
analysis . 

Language  :  Pascal  on  all  projects. 

Programmer:  Single  programmer.  Same  programmer  on  all  projects. 

Using  the  conventions  of  Figure  1,  the  following  is  the  notation 
applicable  to  this  example: 

Metric:  C,  S  collected  at  point  1,  Figure  1. 

Factor:  E,  collected  at  point  2,  Figure  1. 

Critical  Value  of  Metric:  C^,  S^.  validated  at  point  3,  Figure  1. 

V[ Projects  1,2, 3, 4;  Design;  C,  S] 

V[Projects  1,2, 3,4:  Test;  E] 

Discriminative  Power  Validity  Test 

We  divide  the  data  into  four  categories,  as  shown  in  Table  5, 
according  to  a  critical  value  of  C,  C^,  so  that  a  chi-square  test  can  be 
performed  to  determine  whether  can  discriminate  between  procedures 
with  errors  and  those  with  no  errors  (4], 
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Tcible  5 


Contingency  Table 

Complexity  Complexity 

<3  >3 


3  Errors 

1 

! 

1 

75 

I 

) 

1 

6 

rrors 

1 

1 

1 

10 

1 

1 

1 

21 

85  27  112 


From  the  high  value  of  chi-square  (41.60)  (see  Table  6)  and  the  very 
nail  significance  level  (1.26E-10)  in  the  samples,  we  infer  that  =3 
Duld  discriminate  between  procedures  with  errors  (low  quality  software) 
id  those  without  errors  (high  quality  software). 

Table  5  shows  how  good  a  job  C^,  =3  does  to  discriminate  between 
rocedures  with  errors  and  procedures  with  no  errors:  75  of  81  with  no 
rrors  and  21  of  31  with  errors  are  correctly  classified. 


Table  6 


rejects  1,  2,  3  and  4 

L2  Procedures  (81  with  no  errors,  31  with  errors) 


22.32  2.30E-6 

32.14-  1.44E-8 

41.60  1.26E-10 

26.80  2.26E-7 


jnsitivity  Analysis  of  Critical  Value  of  Complexity 

In  order  to  see  how  good  a  discriminator  is  for  this  example,  we 
jserve  the  number  of  misclassif ications  that  result  for  various  values 
:  Cc:  1)  Type  1  ("error  procedures"  classified  as  "no  error 
rocedures")  and  2)  Type  2  ("no  error  procedures"  classified  as  "error 
rocedures").  This  is  shown  in  Figure  8.  As  increases.  Type  1 
.sclassif ications  increase  because  an  increasing  number  of  high 
OTplexity  procedures,  many  of  which  have  errors,  are  classified  as 
iving  "no  errors".  Conversely,  as  decreases.  Type  2 
.sclassif ications  increase  because  an  increasing  number  of  low 
)mplexity  procedures,  many  of  which  have  no  errors,  are  classified  as 
iving  "errors".  The  total  of  the  two  curves  represents  the 
dsclassification  function".  It  has  a  minimum  at  C,,  =  3,  which  is  the 
ilue  given  by  the  chi-square  test  (see  Table  6).  The  chi-square  test 
.11  not  always  produce  the  optimal  C,,  but  the  value  should  be  close  to 
>timal. 
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The  foregoing  analysis  assumes  that  the  costs  of  Type  1  and  Type  2 
nisclassif ications  are  equal.  This  is  usually  not  the  case  since  the 
::onsequences  of  not  finding  an  error  (i.e.,  concluding  that  there  is  no 
Brror  when,  in  fact,  there  is  an  error)  would  be  higher  than  the  other 
::ase  (i.e.,  concluding  that  there  is  an  error  when,  in  fact,  there  is  no 
srror).  In  order  to  account  for  this  situation,  the  number  of  Type  1 
nisclassif ications ,  for  given  values  of  C^,  is  multiplied  by  C1/C2 
(C1/C2  =  1,  2,  3,  4,  5),  which  is  the  ratio  of  the  cost  of  Type  1 

nisclassif ication  to  the  cost  of  Type  2  misclassif ication .  These  values 
are  added  to  the  number  of  Type  2  misclassif  ication  to  produce  the 
family  of  five  "cost"  curves  shown  in  Figure  9.  Naturally,  with  the 
ligher  cost  of  Type  1  misclassif ications  taking  effect,  the  optimal 
(i.e.,  minimum  cost)  decreases.  However,  even  at  C1/C2  -5,  =  3  is  a 

reasonable  choice. 

A  Contingency  Table  was  also  developed  for  S,  leading  to  =  13.  The 
same  type  of  sensitivity  analysis  was  performed  on  S^..  It  was  found  that 
the  optimal  =  15,  as  opposed  to  =  13,  as  given  by  the  chi-square 
analysis . 

We  conclude  that  C  and  S  are  valid  with  respect  to  the  Discriminative 
Power  criterion  and  either  could  be  used  to  distinguish  between 
acceptable  (C  £  3,  S  _<  13)  and  unacceptable  quality  (C  >  3,  S  >  13)  for 
this  and  similar  applications  when  this  data  can  be  collected.  However, 
Dnly  one  is  needed  (i.e.,  C  is  highly  correlated  with  S).  It  should  be 
acted  that  it  is  less  expensive  to  collect  S  than  C. 

SUMMARY  AND  FUTURE  RESEARCH 

We  described  and  illustrated  a  comprehensive  metrics  validation 
nethodology  that  has  six  validity  criteria,  which  support  the  quality 
functions  of  assessment,  control  and  prediction.  Six  criteria  were 
iefined  and  illustrated:  association,  consistency,  discriminative  power, 
tracking,  predictability  and  repeatability.  These  criteria  are  important 
eecause  they  provide  a  rationale  for  validating  metrics;  in  practice, 
this  rationale  is  frequently  lacking  in  the  selection  and  application  of 
Tietrics.  With  validated  metrics  we  have  a  basis  for  making  decisions  and 
taking  actions  to  improve  the  quality  of  software.  We  showed  that 
3uality  factors,  metrics  and  functions  can  be  integrated  with  our 
Tietrics  validation  process.  We  developed  a  framework  which  pulls 
together  the  concepts  and  definitions  of  quality  factor,  quality  metric, 
i/alidated  metric,  quality  function,  validity  criteria,  and  the  metrics 
validation  process.  We  showed  that  non-parametric  statistical  methods 
play  an  important  role  in  evaluating  whether  metrics  satisfy  the 
validity  criteria.  An  example  of  the  application  of  the  methodology  was 
presented  for  the  discriminative  power  validation  criterion.  The 
iiscrirainative  power  criterion  allows  the  metrics  user  to  control  the 
production  of  highly  reliable  software  by  providing  thresholds  of 
acceptable  quality. 

Future  research  is  needed  to  extend  and  improve  the  methodology  by 
finding  an  answer  to  the  following  question: 

D  To  what  extent  are  metrics  that  have  been  validated  on  one  project, 
asing  our  criteria,  valid  measures  of  quality  on  future  projects  --  both 
similar  and  different  projects? 
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APPENDIX  B  (Continued) 
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Abstract 

This  paper  presents  an  organizational  structure,  the  Consolidated  Experience  Factory  (CEF),  for 
instrumenting  the  system  development  process.  Goals  of  interest  to  systems  engineers  and  that  can  be 
satisfied  by  process  and  products  metrics  are  briefly  summarized.  The  goal-questions-metrics 
methodology  has  been  developed  in  the  context  of  software  metrics  lor  deriving  individual  metrics  from 
high-level  goals.  Experience  factories  are  organizations  parallel  to  system  development  organizations 
that  serve  to  define  metrics,  collect  and  validate  data,  analyze  the  data,  and  package  the  results  in  a 
usable  form.  The  CEF  is  an  organization  for  integrating  the  results  of  many  experience  factories. 
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1 .  introduction 


This  paper  presents  an  approach  for  instrumentation,  data  collection,  analysis,  and  improvement 
of  the  systems  engineering  process.  This  approach,  known  as  the  Consolidated  Experierx:e  Factory 
(CEF),  has  been  developed  by  Victor  Basili  of  the  University  of  Maryland  in  conjunction  with  the  Data  & 
Analysis  Center  for  Software  (DACS).  The  CEF  was  defined  for  software  development  and 
maintenance,  but.  as  this  paper  shows,  the  approach  is  general  enough  to  apply  to  systems 
development  as  a  whole.  Given  the  recognized  importance  of  software  in  defense  systems  acquisition, 
the  CEF  attacks  a  crucial  component  of  the  problem  addressed  by  this  conference. 

This  paper  is  organized  into  six  sections.  Section  2  summarizes  the  systems  engineering  needs 
addressed  by  experience  factories.  Section  3  presents  a  rrtethod  used  in  software,  the 
Goals/Questions/Metrics  paradigm,  for  deriving  metrics  from  high  level  goals  Section  4  presents  the 
concept  of  an  experience  factory,  a  logical  or  physical  organization  for  measuring  systems  devefopment, 
while  Section  5  presents  the  Consolidated  Experience  Factory,  an  organization  for  integrating  several 
experierx:e  factories.  Finally,  the  concluding  section  discusses  future  work  needed  to  implement  this 
approach  for  instrumenting  systems  engineering. 

2.  Goal-Driven  System  Metrics 

The  first  premise  of  the  CEF  approach  is  that  measurements  should  be  introduced  into  the 
development  process  to  address  specific  goals.  This  premise  may  appear  obvious,  but  some 
measurement  projects  for  software  have  defined  a  comprehensive  set  of  metrics  such  that  any  high- 
level  goals  were  obscured.  The  Software  Technology  for  Adaptable  Reliable  Systems  (STARS)  Data 
Collection  Forms  {IITRI  85]  are  a  typical  example.  Furthermore,  since  a  complete  feasible  set  of  meirics 
could  not  be  known  at  the  initiation  of  the  field  of  software  metrics,  a  failure  to  recognize  this  premise 
was  probably  a  necessary  stage  in  the  field's  development.  The  introduction  of  system  metrics  should 
take  advantage  of  the  expertise  built  up  in  software  metrics  over  the  last  two  decades  and  begin  with  a 
topdown  approach. 

The  goals  driving  the  system  meirics  collected  on  a  particular  project  should  be  derived  from  the 
requirements  of  that  project,  the  development  organization,  arxl  the  Navy  sponsoring  agencies  The 
particular  metrics  discussed  in  this  paper  are  not  those  needed  to  test  system  requirements,  such  as 
throughput  or  functionality,  but  mainly  process  metrics  to  assist  developers  in  project  management.  For 
example,  a  database  can  be  developed  characterizing  the  system  development  process  in  terms  of  the 
distribution  of  failures  throughout  the  life  cycle.  A  project  manager  can  then  corrpare  how  his  data  fits  a 
typical  project  "fingerprint"  at  any  point  of  time.  Significant  deviations  will  then  suggest  issues  the 


202 


manager  will  want  to  address.  A  metric-based  process  can  be  used  to  insure  systems  are  actually  ready 
tor  scheduled  reviews.  The  Army's  Software  Test  and  Evaluatwn  Panel  metrics  are  being  used  in  this 
way  IDOA-PAM-73XX].  A  database  of  metric  data  on  past  projects  conducted  by  the  developing 
organization  is  required  to  fully  implement  a  metric-based  approach  to  systems  development. 

A  system  development  organization  will  have  goals  that  can  be  supported  by  metrics,  in  additbn 
to  the  requirements  of  individual  projects.  The  organization  will  need  to  develop  the  database  of  metric 
data  that  supports  individual  project  managers.  The  organization  will  want  to  characterize  their 
development  process  so  as  to  support  process  improvements  in  a  controlled  manner.  They  will  want  to 
measure  the  effects  of  proposed  techniques  by  controlled  experiments.  Once  controlled  experiments 
lead  to  a  determination  that  a  new  technology  should  be  adopted,  measurement  needs  to  be  conducted 
to  ensure  that  it  is  properly  transferred  to  individual  projects  and  that  the  expected  benefits  are  obtained. 
The  development  organization  needs  to  understand  relationships  between  the  system  development 
process  and  the  resulting  products.  All  of  these  goals  are  rrwst  fully  supported  by  more  than  raw  metric 
data,  ‘ihe  methods  used  in  packaging  metric-based  models  of  the  development  process  need  to  be 
carefully  considered. 

The  Navy  may  want  to  consider  the  research  needs  of  system  engineering  as  a  whole  in  their 
systems  engineering  program.  If  so,  they  should  consider  how  to  integrate  the  different  research  activies 
conducted  on  system  engineering  and  avoid  bottom-up  isolated  research  activities.  Experimentation  and 
measurement  should  be  used  to  evaluate  and  analyze  systems  research.  The  results  of  research 
should  be  refined  and  tailored  for  application  environments  arxJ  packaged  such  that  they  can  be  easily 
transferred  to  practice.  The  relationships  between  models  of  the  development  process  and  the  products 
should  be  made  clear.  These  high-level  goals,  and  others,  are  addressed  by  the  organization  described 
in  this  paper. 

3.  The  Goais-Question-Metrics  Paradigm 

Various  top  down  approaches  have  been  defined  for  deriving  metrics  from  goals  in  software 
engineering.  For  exanrple,  Rome  Laboratory  has  sponsored  research  that  has  developed  a  hierarchical 
framework  for  measuring  specific  quality  factors  [Bowen  85).  More  suitable  for  a  systems  approach 
because  of  its  generality  is  the  Goals-Questions-Metrics  (GQM)  paradigm  (Basili  90].  in  fact,  if  was 
developed  speciftcalV  to  fit  into  the  experience  factory  approach  described  in  this  paper. 

The  GQM  paradigm  is  a  mechanism  for  defining  and  interpreting  measurable  software  (or  system) 
goals.  Goals  are  typically  stated  in  the  following  format: 
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Purpose : 


Analyze  some  object  (e.g.  process,  product,  experience  nnodel)  for  the  purpose  of  why  (eg. 
characterization,  evaluation,  prediction,  motivation,  inrprovement) 

Perspective 

with  respect  to  focus  (e.g.  cost,  correctness,  defect  removal,  reliability,  user  friendliness)  from  the  point 
of  view  of  who  (e.g.  user,  customer,  manager,  developer,  corporation) 

Environment 

in  the  following  context  (e.g.  problem  factors,  people  factors,  resource  factors,  process  factors). 

The  GQM  paradigm  then  provides  a  structured  process  for  generating  measurement  related  questions 
from  these  goals.  Each  question,  in  turn,  generates  a  set  of  metrics. 

The  GQM  paradigm  is  not  yet  cookbook;  it's  application  requires  a  knowledgable  person.  As  with 
all  topdown  aji^roaches,  its  employment  requires  insight  into  reasonable  lower  level  results. 
Furthermore,  the  templates  developed  so  far  are  for  software,  not  systems.  Nevertheless,  this  paradigm 
is  a  very  promising  approach  for  system  metrics. 


4.  The  Experience  Factory  -  An  Organization  for  System 
Measurment 

Experience  factories  provide  an  organization  to  apply  measurement  to  system  development  to 
best  meet  the  needs  of  individual  projects  and  systems  engineering  organizations  (Basili  89].  An 
experierx^  factory  is  a  logical  organization  supporting  systems  development  by  analyzing  and 
synthesizing  measured  experiences,  creating  a  repository  of  useful  information,  and  supplying  packaged 
results  to  various  projects  as  they  need  them 

Input  to  the  experience  factory  includes  goals  and  data.  The  factory  analyzes  that  data  to 
characterize  the  environment  arxf  systems  engineering  methods,  evaluate,  predict,  and  improve.  The 
outputs  are  packaged  models. 

Packaged  'asults  are  central  to  an  experience  factory.  It  does  not  merely  act  as  a  repository  of 
metric  data.  Packagirtg  can  best  be  displayed  by  means  of  examples.  First,  simple  models  might  include 
simple  formulas  for  prediction.  For  example,  in  software  an  equation  might  be  given  predicting  total  effort 
as  a  furx:tion  of  source  tines  of  code,  faults  per  thousand  lines  of  code  as  a  function  of  what  methods 
are  used,  or  total  schedule  length  as  a  function  of  source  lines.  Packaged  models  of  this  sort  are 
obviously  useful  for  forecasting  and  planning  purposes.  More  complicated  packages  might  include 
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distributions  of  key  variables,  such  as  types  of  failures,  at  particular  points  in  the  life  cycle.  This  second 
type  of  packaged  result  is  useful  for  controlling  a  project  during  its  development,  as  well  as  refining 
predictions.  A  third  type  of  packaged  result  would  be  even  more  detailed.  Models  might  be  created, 
using  a  formal  notation  for  the  development  process,  showing  the  impacts  of  various  combinations  of 
methods  on  the  distributions  of  key  variables.  With  this  sort  of  packaged  result,  the  systems  engineer 
can  design  his  life  cycle  to  meet  his  particular  needs.  The  organization  can  also  use  these  results  in  a 
scientific  manner  to  assess  the  impacts  of  introducing  proposed  systems  engineering  techniques. 

5.  The  Consolidated  Experience  Factory  to  Integrate  Experience 
Factories 

Many  systems  engineering  groups  might  set  up  parallel  experience  factories.  These  factories  will 
then  be  churning  away  producing  packaged  models  that  serve  the  needs  of  their  groups  nxire  or  less 
successfully.  The  discipline  of  systems  engineering  as  a  whole  will  be  best  served  by  integrating  the 
results  of  the  various  experience  factories.  The  Consolidated  Experience  Factory  (CEF)  will  serve  this 
purpose. 

The  CEF  is  an  organization  separate  from  any  developer  or  experience  factory.  It  reef  ves  input 
from  the  various  experience  factories  and  produces  results  of  use  to  them.  The  CEF  would  answer 
questions  like  the  following: 

•  What  questions  and  metrics  have  organizations  found  useful  for  addressing  specific  goals? 

•  If  a  new  method  is  introduced  into  a  systems  development  organization,  how  might  that  effect 
packaged  prediction  models,  based  on  the  packaged  results  of  other  experience  factories? 

•  What  is  the  domain  of  applicability  of  various  models?  For  example,  what  characteristics  of  a 
development  organization  determine  which  of  many  reliability  models  work  the  most  successfully? 

•  Can  "metamodels"  be  produced  that  combine  models  across  experience  factories?  Will  these 
metamodete  be  more  useful  than  models  tailored  (or  an  individual  systems  organization,  especially 
as  those  organizations  change? 

6.  Facing  Many  Questions 

This  paper  has  proposed  a  set  of  organizations  (experience  factories  and  the  Consolidated 
Experience  Factory)  for  instaimenting  systems  engineering  to  support  the  discipline’s  improvement  in  a 
measured,  structured,  and  scientific  manner.  How  can  this  vision  be  brought  to  pass? 
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First,  this  approach  has  only  been  defined  for  software  in  the  past.  The  concept  is  geneial 
enough  to  apply  to  systems,  but  the  details  need  to  be  redefined  for  systems.  For  example,  templates 
have  been  defined  in  the  past  for  applying  the  Goais/Questions/Metrics  paradigm  to  software.  New 
templates  need  to  be  created  for  systems. 

Second,  this  vision  can  be  fulfilled  incrementally.  The  CEF  can  initially  have  a  role  of  helping  a 
small  number  of  organizations  create  their  own  local  experience  factories.  Perhaps,  in  keeping  with  this 
incremental  strategy,  these  initial  efforts  need  only  focus  on  one  aspect  of  the  systems  problem, 
software  being  the  natural  candidate. 

Finally,  details  of  sharing  data  must  be  defined.  Questions  of  data  confiderTtiality  are  not  so 
important  for  a  local  experience  factory  and  its  development  organization.  But  they  are  crucial  for  the 
interface  between  the  Ckjnsolidated  Experience  Factory  and  the  local  experience  factories.  The  concept 
of  "packaged"  results  presents  a  new  approach  to  successfully  addressing  this  problem. 
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This  papcrpresents  a  frameworkfor  classifying  projects  engaged  in 
the  engineering  or  reengineering  of  complex  Navy  systems.  The  object  of 
this  framework  is  to  establish  comparability  among  dissimilar  projects  and 
to  aid  in  the  transition  to  newer,  more  effective  paradigms.  The  paper 
examines  the  problems  faced  by  the  Navy  in  the  evolution  of  existing 
systems  and  analyzes  the  software  process  in  the  context  of  this  challenge. 
A  four-dimensional  framework  is  defined  for  classifying  projects  and  their 
data. 


Introduction 

The  development,  maintenance,  and  evolution  of  mission-critical,  real-timesystems 
is  a  very  complex  task.  These  systems  are  comprised  of  many  communicating  independent 
subsystems  that  must  w  ork  together  in  a  variety  of  stressful  environments,  some  of  which 
will  be  untested  prior  to  the  initial  encounter.  The  systems  are  governed  by  the  laws  of 
physics,  which  may  impose  severe  time  constraints  on  both  decisions  and  actions,  and  they 
must  support  the  decision  making  of  those  who  use  and  rely  on  them.  These  systems  are 
composed  of  hardware,  software,  and  humans,  and  each  subsystem  receives  inputs  from  and 
directs  outputs  to  other  hardware,  software,  and  human  interfaces. 

As  demonstrated  in  Desert  Storm,  the  Navy  has  developed  excellent  systems  that 
perform  very  well.  Many  of  these  systems  require  more  than  a  decade  for  development,  and 
budgetary  constraints  are  certain  to  limit  the  number  of  completely  new  systems  hat  will 
be  built  in  the  near  future.  Thus,  the  challenge  is  to  support  the  refinement  of  the  existing 
systems  by  accepting  modified  or  enlarged  missions  and  exploiting  emerging  technologies 
while,  at  the  same  time,  taking  advantage  of  the  Navy’s  significant  investment  in  existing 
systems. 

No  formal  models  exist  for  a  complete  system.  There  are  scientists  and  engineers 
who  are  specialists  in  selected  areas  such  as  sonar,  radar,  command  and  control,  and 
weapons  control  systems.  Yet,  as  the  systems  become  further  integrated  and  complex,  we 
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find  problems  that  can  be  resolved  only  in  a  multidisciplinary  setting.  The  consequences  of 
interaction  are  difficult  to  anticipate,  and  there  are  few  formal  mechanisms  for  modeling  the 
nonfunctional  requirements  associated  with  timing  and  resource  utilization.  In  summary, 
we  are  confronted  by  “wicked  problems”  that  can  be  neither  resolved  within  a  single 
discipline  nor  comprehended  by  a  single  individual.  Our  successes  to  date  illustrate  that 
viable  solutions  are  within  the  state  of  the  art.  Our  goal,  therefore,  is  to  continue  this 
progress  in  a  mode  that  emphasizes  upgrade  rather  than  retirement,  reuse  rather  than 
replacement. 

Although  the  task  at  hand  may  seem  very  restrictive,  there  are  at  least  two  reasons 
for  optimism.  First,  one  learns  by  experience,  and  the  existence  of  complex,  effective 
tactical  systems  is  evidence  that  significant  stores  of  knowledge  exist.  One  problem  with  this 
knowledge  base  is  that  it  is  poorly  organized.  In  software,  for  example,  the  knowledge  of 
what  the  system  does  is  embedded  in  the  code,  which  describes — not  what  is  done — but  how 
it  is  done.  One  consequence  is  that  only  small  software  modifications  are  attempted;  lacking 
system-level  understanding,  program  managers  avoid  risk  by  localizing  change.  Such  an 
approach  is  quite  reasonable  in  an  environment  in  which  the  next  generation  system  will 
replace  the  present  generation,  but  it  is  incompatible  with  a  philosophy  of  evolving  from 
one  generation  system  to  the  next. 

Fortunately,  the  second  reason  for  optimism  addresses  this  problem.  A  revolution 
in  the  way  we  perceive  software  is  underway,  and  this  makes  the  more  rational  capture  and 
reuse  of  system  knowledge  possible.  This  transition  to  a  new  view  of  software,  however,  is 
still  in  its  early  stages,  and  most  research  in  software  engineering  builds  on  the  existing 
paradigm.  Thus,  when  viewed  in  the  context  of  how  projects  now  are  run,  the  new  approach 
may  seem  speculative  and  high  risk.  Yet,  because  it  offers  such  a  potential  for  process 
improvement  (as  measured  in  productivity,  adaptability,  and  quality),  it  is  an  important  area 
of  research.  Further,  because  many  researchers  have  been  working  with  this  paradigm  for 
more  than  a  decade,  there  is  also  a  potential  for  near-term  application. 

Since  1980 1  have  been  working  with  a  modified  paradigm  for  systems  development. 
Most  of  my  experience  has  been  with  interactive  information  systems,  and  a  decade  of 
activity  has  been  carefully  evaluated  and  reported  [Enlb89,  Blum90].  During  the  past  few 
years  I  have  been  working  in  the  domain  of  complex  systems,  albeit  at  the  conceptual  level. 
One  product  of  this  research  is  a  new  understandingof  how  we  develop  systems  and  the  role 
that  software  plays  in  a  system’s  development  and  evolution.  This  understanding  now  is 
sufficiently  mature  to  permit  the  formulation  of  a  framework  for  classifying  activities  that 
both  aids  in  project  comprehension  and  leads  to  the  adoption  of  more  effective  methods. 

This  paper  describes  a  general  framework  for  engineering  and  reengineering  that 
should  aid  in  the  evolution  from  one  generation  of  system  to  the  next.  Because  new  systems 
will  e’  olve  from  one  generation  to  the  next,  the  distinction  between  engineering  and 
reengineering  is  fuzzy.  For  example,  the  introduction  of  a  newly  engineered  component 
may  have  no  immediate  effect  on  the  system’s  functionality,  and  the  need  to  comply  with 
existing  interfaces  may  constrain  a  new  development.  Thus,  rather  than  distinguish  between 
these  complementary  activities,  I  shall  treat  them  as  one.  Therefore,  the  objective  of  an 
engineering/reengineering  framework  is  to  guide  thedevelopmentandevolutionofasystem 
throughout  its  productive  lifetime.  It  must  consider  the  immediate  concerns  of  the 
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managerial  and  technical  staffs  responsible  for  Iheengineering/reenginecring  activities,  and 
it  also  must  examine  how  their  elforts  can  build  a  foundation  for  improved  technology. 
Restating  this,  the  framework  must  provide  guidance  for  today’s  projects  and  build  a  bridge 
to  the  technology  of  the  next  decade. 


Observations  on  Software  Engineering 

My  area  of  interest  is  software  engineering,  which — inanarrowsense — involves  the 
managing,  conducting  and  evaluating  the  development  and  maintenance  ol  software 
components.  It  is  a  branch  of  engineering  in  that  it  focuses  on  the  creation  of  useful 
artifacts  through  the  application  of  scientific  principles.  It  differs  from  most  other 
engineering  disciplines  in  that  software  engineering  is  not  bound  by  the  physical  laws  of 
nature;  rather  it  is  guided  by  the  models  that  formalize  our  current  understanding.  That  is, 
unlike  electrical  engineering,  which  must  respond  to  repeatable,  external  phenomena,  the 
solution  space  of  the  software  engineer  is  dominated  by  the  formal  models  created  by 
computer  science  (e.g.,  programming  languages,  tools  for  representing  abstractions).  The 
software  engineer’s  models  are  artifacts:  products  of  human  creativity.  Unlike  the  laws  that 
explain  the  behavior  of  electrons,  these  models  establish  an  approach  to  software  dev 
ment  that  becomes  a  self-fulfilling  prophesy.  The  software  engineer’s  interpretatir  .  tthe 
problem  determines  his  response  to  it,  and  one  theme  of  my  research  is  that  th  ■  aoftware 
engineer  begins  with  an  imperfect  problem  statement. 

Although  software  engineers  focus  on  a  software  product,  it  must  be  recognized  that 
virtually  every  software  product  must  operate  on  a  computer  (i.e.,  with  hardware)  and 
interact  with  users  (i.e.,  humans).  Thus,  software  is  always  a  part  of  some  larger  system 
(which,  in  turn,  may  be  part  of  an  even  larger  system).  Moreover,  the  goal  of  that  system 
is  to  meet  some  need  in  the  application  domain.  The  interpretation  of  a  software  product 
outside  the  scope  of  the  system  and  the  need  •'  addresses  represents  a  level  of  abstraction 
fraught  with  danger.  Thus,  the  primary  challenge  that  the  software  engineer  faces  is  not 
how  to  write  or  modify  some  piece  of  code;  rather,  it  is  to  understand  how  that  code  meets 
some  need.  The  software  engineer  must  recognize  that  he  is  performing  systems 
engineering  and  that  the  code  is  simply  an  expression  of  the  most  detailed  design  of  a 
response  to  a  need.  Unfortunately,  this  mode  of  thinking  is  not  very  common,  which  can 
result  in  long-term  con.sequcnces  over  the  iM'e  of  a  system. 

The  seeds  of  today’s  software  orientation  were  sowed  in  the  early  days  of 
computing.  The  first  need  was  to  produce  programs;  symbolic  assemblers  and  high-order 
languages  made  that  task  easier.  Once  we  mastered  the  writing  of  programs,  we  confronted 
the  difficulty  in  creating  systems.  The  discipline  of  software  engineering  was  spawned  by 
the  NATO-sponsored  conferences  [NaRa69j.  These  meetings  focused  on  the  development 
of  large-scale,  system-oriented  software.  The  waterfall  fiow,  first  introduced  by  Royce 
[Royc70]  in  1970  and  later  refined  by  Boehm  {Boeh76j,  introduced  a  phased  development. 
One  could  not  go  on  to  the  next  phase  until  the  previous  phase  was  complete  and  validated. 
The  output  from  each  phase  was  used  to  define  the  scope  of  its  successor  phase,  and  the 
model  provided  for  corrective  feedback  to  earlier  phases.  This  model  for  software 
development  was  copied  from  experience  with  hardware;  indeed,  Boehm’s  waterfall  diagram 


211 


ditt’crcd  trom  the  hard'.vurc  iks'kV  uiily  tn  ihe  u->c  i>i  '‘iDtuvaie’’  in  she  isdc  ■  Svjf'ivv  .ttc 
Requirements"  and  the  relaticltnmd  Fahiica'iu.a'  "Ckide  and  Ik-buiy" 

This  pafalic!  hctssccti  vi'l-.s.jfe  a:Ki  ha.rsi.s.ue  Cisrsuiiues  tiki.u  Hete  is  !iv!v>  a 
recent  btiok  puts  it  (CaCiiMUj 

Software  design  is  th.c  (noduct  engineering  part  n!  sottware  dcvckipmcm 

Programming  is  in  stunc  ssavs  .maltiguus  to  the  tiianulacturing  part 

Although  the  water  fall  diagram  has  txren  much  maligned,  most  ol  the  |>ro|Xi.vcd  alter  natives 
can  be  seen  as  ada[)talions  ol  ihal  basic  nKH.lel  Prototyping  was  intrixluced  as  a  means  to 
validate  the  sixrcificatKm  betore  the  phased  dcvekipmeni  Icegins  {(ioSeXJj  nic  spiral 
mcxlel  emphasizes  the  nsk-teduciion  activities  m  thccarly  phases  of  development  |lkKhK8j, 
It  too  creates  a  valid  s(vecilication.  and  implementation  follows  a  traditional  phased 
approach.  Incremenla!  development  divides  the  priKcss  into  layered  builds,  w  ith  each  build 
following  a  phased  plan  [GiibRS]  In  each  ol  these  nuHlels,  the  srrfiwarc  priKcss  Ixrgins  with 
a  definition  of  what  the  system  should  do  nns  is  fidlowed  by  a  design  of  how  the  system 
should  be  implemented,  and — once  the  design  is  detailed  enough  to  [Kimit  coding — the 
programs  arc  implemented.  Programs  arc  tested,  and  tested  conifsoncnts  arc  then 
integrated  and  tested  again.  The  perception  is  that  software,  like  hardware,  is  a  prcKfuct  to 
be  implemented.  Analysis  and  design  determine  what  the  product  should  do  and  how  n 
.should  be  constructed,  integration  and  test  establish  that  it  [K'rforms  as  desired 

1  believe  that  this  hardware-based,  product  oriented  view  of  the  software  prtKC.ss 
has  led  us  to  focus  on  the  software  implementation  rather  than  the  knowledge  that 
motivated  its  creation  (Blum92c].  This  idea  can  best  be  introduced  by  way  of  the  software 
process  melamodel  in  Figure  I  It  presents  the  essence  of  the  software  pr(Kc.ss  as  a 
transformation  from  some  need  in  an  applicatum  domain  into  a  software  implementation 
that  responds  to  that  need.  Two  nonintcrscclmg  mtxjcling  lines  arc  shown.  The  conceptual 
models  reilcct  the  application  domain  perspective;  they  describe  the  proposed  response  to 
the  need.  Although  the  conceptual  models  use  domain  formalisms  and  express  the  domain 
specialists'  inlcni,  they  are  not  formal  in  the  computer  science  sense.  They  arc  termed 
conceptual  because  ihey  de.scnhc  but  do  not /woenbe  the  .software  solution.  The  conceptual 
models  must  be  transformed  iniofornuil  models  that  establish  the  essentia!  behaviors  and 
performance  oi  the  desired  software  product .  Finally,  details  arc  added  until  an  implementa¬ 
tion  exists  that  is  correct  with  respect  to  the  formal  model.  ITic  implementation,  of  course, 
is  also  a  formal  model. 

Software,  however,  is  not  static  Lehman  defines  E-lype  programs  as  programs  that 
alter  the  requirements  to  which  they  respond,  thereby  initiating  a  demand  for  change 
[LehmSO].  Thus,  the  mctamodcl  of  Figure  1  is  but  one  iteration  within  a  continuing  cycle 
of  change  and  improvement.  The  transformation  represented  by  this  mtxicl.  from  a  need 
to  a  software  product  intended  to  meet  that  need,  can  be  decomposed  into  a  sequence  of 


I 
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Although  there  arc  some  soliwarc  prixcss  mvxlels.  such  as  the  operational  approach  [/avcS4)  and  mcxtcl 
execution  (Harc921,  that  do  not  echo  the  hardware  development  model,  space  docs  not  allow  u.stoconsidcr  them 
here.  For  a  more  complcic  discussion  nt  ihis  lopic  see  [H!um92a.  Blum92bl. 
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Figure  1  i  he  essential  si>ltwarc  prixcss. 


three  Iranslormations. 

From  the  need  idenlitied  in  the  a()plieation  domain  to  the  concx'plual  iiukIcI  that 
establishes  how  the  technology  can  provide  an  appropriate  solution  Here  analysts 
require  a  deep  understanding  ot  the  application  domain  plus  knowledge  of  the 
potential  solutions  supported  by  the  lechnoiogv. 

From  the  conccquua!  model,  sshich  describes  in  terms  natural  to  the  domain 
specialist  svhal  is  to  be  implemented,  to  a  formal  model,  sshich  establLshc-s  the 
behavior  and  performance  of  the  product  to  be  delivered. 

From  the  forma!  model  to  the  implementation.  This  is  the  histone  domain  of 
soflsvarc  engineering.  The  final  step  in  this  process  (e.g.,  compilation)  is  alway-s 
automated. 

Notice  that  many  distinct  classes  of  conceptual  model  will  Ik'  valid  rcspomscs  to  a  given 
need,  many  distinct  cla.s.scs  of  formal  model  will  be  valid  translations  of  a  given  conceptual 
model,  and  many  distinct  classes  ofimpicmcnlationsvill  be  correct  fora  given  formal  model. 
Thus,  the  software  process  (and.  indeed,  every  design  process)  involves  successive 
restrictions  of  the  solution  space  until  only  one  solution  exists.  Knowledge  related  to 
rejected  alternatives  seldom  is  retained.  This  implies  that,  as  the  product  evolves,  weenrich 
our  understanding  of  the  particular  solution  that  the  prorluct  represents,  but  wc  lose 
knowledge  associated  only  with  alternative  solutions. 

It  is  important  to  distinguish  between  a  need  and  a  specific  response  to  that  need. 
As  it  is  currently  constituted,  the  .software  process  narrows  the  solution  space  to  realize  a 
particular  response  to  a  need.  The  need,  however,  can  be  open  (i.c.,  there  arc  many 
potential  responses  that  could  satisfy  it)  or  closed  (i.c..  the  solution  implies  a  specific 
response)  [BlMo92].  Con.scqucntly,  there  will  be  many  conceptual  and  formal  models  that 
can  fulfill  the  intent  of  an  open  application  need,  and,  in  contrast,  few  models  will  exist  that 
can  satisfy  the  requirements  of  a  closed  objective.  But  once  a  solution  is  accepted  and 
expressed  as  a  formal  model,  the  problem  space  is  changed.  Correctness  of  the  software  is 
with  respect  to  that  particular  solution  (i.c,.  not  with  respect  to  the  initial  need).  After  the 
formal  model  exists,  the  design  is  open  only  to  the  extent  that  alternative  designs  arc 
available  to  satisfy  the  requirements  (i.e.,  the  problem  space  becomes  that  of  software 
implementation).  When  one  begins  with  an  understanding  of  the  application  need,  the 
openness  of  the  problem  is  apparent.  On  the  other  hand,  if  one  begins  with  the  existing 
implementation,  it  is  very  difficult  to  distinguish  betw'cen  the  results  of  a  de.sign  decision  and 
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the  inherent  constraints  of  the  problcnr  That  ts,  vshcn  tioin  the  left  to  the  tight  m 
Figure  1,  one  sees  the  software  in  itic  etintest  ol  the  prohlcni  to  be  sr)i\cd.  In  ionira^t, 
when  going  from  the  right  tt>  the  leit.  one  secs  the  profrlem  in  the  context  oS  a  p.uuvul.u 
response  to  it. 

This  difference  in  perspccti'.es  is  cajnured  in  [-igutc  2  It  shows  two  wt!t»a!e 
paradigms.  The  product -oriemed  paradigm  i,>  the  tradiiitsnal  hardware-bused  sicw  ot  live 
process.  One  begins  with  a  system  sjxeihcation  (i  e.,  the  hsrnial  nnxlcl  «>f  J-igure  J )  and 
concludes  with  the  implementation  of  the  system.  I'he problem anenifd  jvaradtgm.  on  the 
other  hand,  operates  within  the  problem  space,  it  begins  with  the  identrficatiitn  rd  a  soluium 
to  the  problem  and  ends  with  the  complete  design  ot  that  sidutiun  In  the  context  of  the 
three  transformations  in  Figure  1,  the  poKluct  orientation  ttwuseson  the  tliird  ttansforrna 
tion  and  the  problem  oricniauon  on  the  first  two,  {'nic  solution  design  is  the  dciaiicd 
formal  model.)  f'or  closed  probierns,  there  are  less  solutions,  and  urte  can  lx*  s(x*cifk'd  and 
implemented.  For  open  problems,  however,  there  is  the  danger  that  the  pnxJuct  will  fx- 
bound  to  a  solution  that  becomes  obsolescent,  here  it  is  de.sirablc  to  retain  a  full 
understanding  of  the  problem  to  be  solved  and  the  afternative  solutions  under  consider- 
alion.  Unfortunately,  the  two  paradigms  are  incompatible,  they  cannot  be  tnerged.  and  one 
cannot  evolve  from  the  other.  And  this  pi. ices  us  on  the  horns  of  a  dilemma.  Many  of  ibc 
Navy  tactical  systems  insnlv  c  hardware,  \xhich  must  be  sjxxiried  before  it  is  manufactured, 
yet — as  noted  in  the  iniroduciion — most  itirjrrovcmcnts  to  Navy  tactical  systems  will  cirmc 
through  incremental  enhancements. 


Ptexiuct  Oriented 

Problem  Oriented 
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System 

Solution 

Specification 

Identification 

Bottom 

System 

Solution 

Implementation 

Dc.sjgn 

Figure  2  Two  .software  paradigms. 


The  con.scqucnccs  of  this  tension  between  system  goals  and  prtxlucl  structure  arc 
illustrated  in  the  next  three  figures.  Figure  ^  depicts  the  formal  knowledge  of  a  system 
developed  with  a  produel-oricnicd  paradigm.  It  depicts  a  concept  of  operat'ons,  which  is 
supported  by  a  requirements  document  that  defines  a  .specific  system  to  support  that 
concept  of  operations.  The  remaining  four  trapezoids  in  the  diagram  rcpre.scnt  the 
successive  levels  of  detail  necessary'  to  produce  a  product  that  supjxvrts  the  intended 
concept  of  operations  in  the  prescribed  manner.  (The  diagram  .shows  the  source  code  as 
the  lowest  level  of  design  detail,  not  a  product  created  from  the  design.)  If  the  concept  of 
operations  is  static  and  if  the  knowledge  is  well  documented,  then  the  knowledge  structure 
shown  in  Figure  3  is  very  effective.  One  can  trace  code  to  requirements  and  concepts  to 
operational  entities.  But  there  are  two  fundamental  problems  with  the  knowledge  base  in 
Figure  3.  First,  the  knowledge  is  poorly  organized,  often  incomplete,  and  difficult  to  access 
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Figure  1  Kno'i'.lcd^c  i.>f  ;i  ncwsN'Stcm. 

or  inlcgralc,  ll  loncis  to  he  cliKuntciU  l>asctl.  anil  there  are  relatively  few  formally 
maintained  links  beuveen  levels  (e  g.,  changes  to  the  source  cixlc  may  not  Ik  reflected  in 
the  PDL  and  vice  versa).  Second,  the  knowledge  base  is  dynamic,  and  the  concept  of 
operations  shifts  as  system  experience  grows  and  as  the  external  environment  changes. 

Figure  4  demonstrates  the  knowledge  shift  as  a  concept  of  operations  adapts  to 
cxierntil  requirements.  In  this  extimple.  the  concept  of  opei  Mions  has  identified  .1 
completely  new  mission,  but  the  product  itself  has  not  changed  (i.c.,  the  source  code  is 
unaltered).  Now  there  is  a  mismatch  between  what  the  program  ^source  c(xlc)  docs  and 
what  is  needed  (the  concept  of  operations).  The  figure  shows  the  requirements  for  a  new 
product  that  will  respond  to  the  new  need.  Although  the  new  requirements  differ  from  the 
old  requirements,  much  of  the  design  (and  source  code)  can  be  ••cused.  Dashed  lines  depict 
ihe  parts  of  the  existing  system  that  arc  obsolescent  with  respect  to  the  new  concept  of 
operations.  In  an  era  of  new  .system  development,  the  "safe”  approach  would  be  to  scrap 
the  old  system  and  custom  build  a  new  system.  Such  an  attack  is  now'  recognized  as  being 
too  expensive  and  of  too  high  a  risk.  Consequently,  there  is  a  need  to  reengineer  reusable 
components  in  the  older  system  and  to  guide  the  development  of  the  new  system  in  the 
exploitation  of  the  reusable  components  |Frcc87.  Trac88.  PrAr91,  Boch90|.  From  a 
knowledge-based  perspective,  the  challenge  is  to  look  down  from  the  concept  of  operations 
to  identify  what  concepts  and  operations  can  be  generalized  (c.g..  the  domain  analysis 
orientation)  and  look  up  from  the  level  of  the  .source  code  to  identify  what  components  can 
be  reengineered  for  reuse  (c.g..  the  component  library  orientation).  Unfortunately,  the 
knowledge,  arranged  so  neatly  in  Figure  X  is  not  .strucluied  to  support  such  a  transition. 

A  different  change  scenario  is  presented  in  Figure  5.  Here  the  concept  of 
operations  document  has  remained  relatively  fixed,  but  the  programs  have  been  altered. 
Thus,  although  there  have  been  many  changes  to  the  source  code,  the  rationale  for  these 
changes  may  not  have  been  documented  as  changes  to  the  concept  of  operations  or  the 


215 


Figure  4  Knowledge  o!  an  existing  ss'stem  and  a  new  concept  of  operations, 

requirements.  Because  the  higher  level  documentation  (i.e.,  knowledge)  is  out  of  date,  it 
is  viewed  as  untrustworthy,  and  the  incentive  lor  not  updating  the  higher  level  dcKuments 
increases.  That  is,  because  the  dcKumcntation  has  not  been  updated,  it  is  not  used;  because 
it  is  not  u.scd.  it  will  not  be  updated.  Confronted  with  this  reality,  the  maintaincr  focuses  on 
the  object  of  change  anti  not  the  reason  for  chant’c.  (ITiat  is.  on  the  prcxJuct  to  be  altered 
and  not  the  problem  to  be  solved.)  Consequently,  changes  arc  limited  to  what  can  be 
understood,  and  knowledge  of  the  system  degrades  turthcr. 
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One  method  lor  relieving  this  tension  is  to  insiiiuie  u  knowledge -based  approach 
in  which  knowledge  of  the  application  domain,  the  tcchnokrgy  used,  and  specific  s\-siemsiire 
maintained  in  an  integrated  manner  that  permits  reuse,  prototyping,  and  the  partially 
automated  generation  of  products.  There  are  many  researchers  working  on  resfxmscs  of 
this  general  category,  and  the  purpose  of  this  jiairer  is  to  establish  a  framework  lt>r 
classifying  Navy  activities  so  that  the  c.xperience  may  be  exploited  in  the  development  and 
adoption  of  new  paradigms.  In  the  context  of  w  hat  has  been  prc.sentcd  in  this  section,  my 
personal  view  is  that  we  are  in  a  product-oriented  (laradigm  and  should  move  to  a  problem 
oriented  paradigm.  Many  (including  myself)  are  working  on  alternative  approaches,  and  it 
would  be  premature  to  speculate  about  the  lor  rn  of  systems  in  that  paradigm.  .Nevertheless, 
it  should  be  clear  that  the  new  paradigm  will  be  knowledge-based,  that  it  will  offer  little 
immediate  help  to  projects  embedded  in  a  product  -oriciucd  paradigm,  and  that  it  represents 
our  best  hope  for  the  inexpensive,  reliable,  and  tlexiblc  evolution  of  future  Nasy  tactical 
systems 


A  Engineering/Uecnginecring  Franu-work 

The  objective  of  this  section  is  to  identify  a  framework  for  the  description, 
collection,  and  assessment  of  enginccringTccnginccring  projects  for  complex  systems. 
Lacking  such  a  framework,  commonality  among  projects  will  be  obscured,  and  the  transfer 
of  project  experience  wii!  be  degraded.  The  framework  should  have  two  important 
properties. 

It  should  aid  the  project  organizers  and  evaluators  in  the  articulation  of  the  project 
goals. 

It  should  aid  in  the  refinement  of  cross-project  knowledge  and  facilitate  the 
introduction  of  new'  concepts  and  methods. 

At  this  point  the  definition  of  the  framework  is  speculative,  and  it  has  not  been  tested  with 
the  classification  of  real  projects.  To  facilitate  analysis  at  this  early  stage  of  analysis,  1 
restrict  the  framework  dclinition  to  just  the  cngineering/reenginccring  of  the  software 
components  in  a  .system.  The  framework,  however,  should  be  extensible  to  include  all 
system  components. 

There  arc  four  dimensions  in  the  .software  cnginccring/rccngincering  framework. 

Problem  ff-a  mil  a  rity  characterizes  the  problem  to  be  solved  by  the  project.  It  may 
range  from  a  full  system  to  be  engineered  to  a  single  component  to  be  re¬ 
engineered.  Associated  with  granularity  arc  effort,  cost,  and  schedule  constraints 
plus  estimates  of  the  available  experience.  The  objective  is  to  associate  experience 
with  some  granularity  measure  so  that,  for  example,  experience  w'ith  10  effort-year 
projects  can  be  referenced  by  other  projects  of  comparable  size. 

Problem  level  characterizes  the  level  of  problem  addressed.  I  identify  three  levels. 
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Knawlciii^L'  onentcil.  This  is  liic  Ic’.c!  jusi  described  as  prohieni-orienled. 
Us  goal  is  to  use  knowledge  to  guide  aspects  of  the  process.  Examples 
include  domain  analysis,  megaproc ramming,  and  the  operational  approach 

Object  oriented.  Although  this  is  a  product-oriented  view,  it  is  higher  level 
than  that  ol  the  code.  It  rnaximi/es  the  benelit  provided  by  encapsulation 
and  inlormation  hiding.  Projects  that  employ  the  Ada  programming 
language  at  the  design  level  provide  experience  at  this  level. 

Prtxliict  oriented.  This  is  the  lowest-level  of  engineering^'reengineering.  Us 
goal  is  to  produce  components,  and  the  emphasis  is  on  the  component,  not 
the  domain  activity  it  supports.  Routine  program  maintenance  is  an 
illustriition  ofti  project  at  this  level. 

The  goal  oT  software  engineering  should  be  to  raise  the  level  of  the  problem  Iseing 
addres.scd. 

Project  nioiieniion  classilies  the  rationale  behind  the  project's  initiation,  I  identify 
five  motivations. 

New  product.  This  is  the  creation  of  a  new  product. 

Correction.  This  is  a  modification  of  an  existing  prcxluct  to  correct  a  fault 
or  dcllcicncy.  It  is  similar  to  the  error  repair  of  corrective  maintenance. 

Adaptation.  This  is  the  modification  of  an  existing  product  that  docs  not 
alter  the  functionality  of  the  product  but  that  alters  the  product  to 
accommodate  an  altered  environment  (e.g.,  changes  to  a  fire-control 
software  module  to  conform  to  an  altereo  radar  interface). 

Enhancement.  This  is  the  modification  of  an  existing  product  to  alter  and 
improve  its  functionality,  performance,  etc.  It  is  similar  to  perfective 
maintenance. 

Experimental.  This  motivation  is  re.served  for  projects  that  experiment  with 
a  new  technology  or  concept  (e.g..  the  conversion  of  CSM-2  programs  to 
Ada). 

Some  projects  may  have  more  than  one  motivation,  but  the  framew'ork  may  permit 
only  one  motivation  for  a  project. 

Supporting  paradii^m  refers  to  the  problem-oriented  and  product-oriented 
paradigms  depicted  in  Figure  2.  For  the  near  term,  the  definition  of  the  problem- 
oriented  paradigm  is  extended  to  include  both  projects  that  employ  (hat  paradigm 
and  projects  that  have,  as  a  primary  goal,  the  building  of  a  knowledge  base  for 
potential  u.se  by  a  method  employing  the  problem-oriented  paradigm. 
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The  bcnciit  ol  this  traincwork  is  that  all  pmjccts  can  be  placed  in  a  project  space 
that  permits  comparisons  of  characteristics  across  projects.  P'or  example,  lines  of  code  pet 
hour  measures  are  like  blood  pressure  readings;  v.  ithout  knowing  the  patient,  diagnosis,  and 
therapy,  a  reading  of  1 50AS‘5  is  nicaningle.ss.  By  tying  a  project  to  a  node  in  the  frarncwoi  k, 
one  has  a  baseline  for  comiiaring  results  from  different  projects  or  methods.  One  also  can 
use  the  framework  to  organize  more  detailed  investigations.  For  instance,  consider  the 
problem  level  dimension  for  reengineering  projects. 

Reengineering  at  the  knowledge -oriented  level  treats  the  knowledge  in  the  ss-stern 
documents  as  a  unified  w  hole  that  permits  integration  of  concepts,  reuse  of  evaluation  tools 
and  techniques,  and  access  to  both  current  and  hi.storical  design  information.  Therefore, 
the  framework  at  this  le\  cl  can  consider  questions  such  as; 

What  knowledge,  documentation,  simulations,  and  other  Itxils  are  available  for 
systems,  and  how'  accurtiie,  tlexible,  and  transportable  arc  they? 

What  knowledge  is  available  in  forms  that  can  be  processed  and  indexed  within  an 
integrated  database?  What  is  the  granularity  of  this  knowledge,  and  can  it  he 
adapted  for  processing  by  off-the-shelf  tools? 

What  research  in  knowledge  representation,  simulation,  model  execution,  and  so 
on,  would  be  applicable  to  adaptation  in  support  of  reengineering? 

At  the  knowledge-oriented  level,  the  intent  is  to  understand  the  domain  so  that  a 
transition  plan  (or  bridge  to  the  new-  technology)  can  be  proposed,  For  the  object-oriented 
level,  the  go:il  is  to  evaluate  the  success  in  utilizing  process-improvement  techniques  that 
emphasize  encapsulation,  reuse,  and  components.  In  the  forward  engineering  view,  the 
following  questions  can  be  answered: 

How  arc  new  development  activities  exploiting  the  features  of  Ada?  Arc  there 
project  evaluations  liiat  would  aid  other  projects?  Arc  there  libraries  available  for 
exchange? 

Arc  there  measures  for  the  degree  of  encapsulation  and  rcu.se  employed?  Arc 
there  revised  models  for  the  software  process  using  these  development  methods? 
How  reliable  are  these  measures? 

For  reengineering  additional  questions  can  be  addrc.sscd: 

To  what  extent  has  reengineering  used  an  object-oriented  level  of  abstraction? 
What  arc  the  costs  and  benefits  of  this  technique? 

How  is  encapsulation  employed  in  Ada-ha,scd  reengineered  software  projects? 
What  is  reused  ;md  what  is  packaged  for  reu.se  by  subsequent  packages?  Which  of 
the  Ada  features  arc  used  in  reengineered  software?  How  (if  at  all)  is  the 
documentation  altered  to  accommodate  the  object  orientation? 
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At  the  product-oriented  level,  the  primary  concern  is  lor  the  methods  used  in 
transforming  a  product  from  one  form  (e.g.,code  in  CSM  -2)  to  another  (e.g.,  code  in  Ada). 
Of  particular  interest  here  are  questions  such  as: 

What  criteria  were  used  to  decide  to  reengineer  ;i  component?  To  control  and 
evaluate  the  reengineering  activity?  To  validate  the  reengineered  product? 

What  technology  and  tools  were  used  to  reengineer  the  product?  How  was  the 
design  knowledge  captured?  What  level  of  abstraction  was  used  to  guide  the 
forward  engineering  process? 

What  was  the  motivation  for  the  reengineering  (e.g.,  improved  interoperability, 
hardware  change,  altered  functionality)?  Are  there  different  reengineering 
methods  for  different  motivations? 

Thus,  the  availability  of  a  sound  framework  not  only  guides  in  the  analysis  of  project  data, 
but  it  also  assists  in  the  definition  of  new  analytic  efforts. 


Summary 

This  paper  began  w'iih  an  examination  of  Naw  tactical  systems  and  observ'cd  that 
most  future  improvements  will  result  from  the  evolution  of  existing  systems  rather  than  the 
development  of  new  systems.  The  paper  then  addressed  the  cnginccring/rcenginecring 
issues  associated  with  this  need  for  continuing  evolution.  The  emphasis  was  placed  on  the 
software  process,  and  the  discussion  concluded  that  (a)  the  present  paradigm  was  limited, 
(b)  no  near-term  alternatives  are  available  for  complex  Navy  projects,  and  (c)  there  is 
enough  advanced  knowledge  to  support  the  building  of  bridges  from  the  present  to  the 
future  paradigms.  An  organizing  framework  w’as  introduced  that  can  guide  in  the  analysis 
of  data  and  assist  in  the  formulation  of  new  studies. 
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ABSTRACT 


The  representation  of  resources  is  a  major  issue  in  designing  large  and  complex  systems. 
The  ability  to  represent  and  analyze  these  resources  early  in  the  design  process  supports  the 
understanding  of  how  resources  are  utilized,  resulting  in  a  major  cost  reduction  in  system 
integration.  This  paper  presents  a  method  to  generate  the  system's  Implementation  Capture  View. 
This  view  is  defined  as  a  documentation  of  the  resources  and  their  interfaces  which  make  up  the 
system  under  design  including  the  hardware,  software,  and  human  operators.  The  Implementation 
Capture  View  also  includes  documentation  of  the  resource  selection  and  design  rationale,  and  a 
mapping  from  the  Functional  and  Behavioral  Capture  Views  to  the  resources  in  the  Implementation 
Capture  View. 


Introduction 

Research  on  the  implementation  part  of  the  system  has  been  conducted  and  is  being 
continuously  updated.  Mostly  the  implementation  issue  is  addressed  as  part  of  the  design  process, 
and  the  representation  of  the  system's  resources  is  limited  to  a  certain  design  scope.  In  their  book 
[Rum],  Rumbaugh,  et  al.,  present  an  Object-Oriented  based  approach  in  dealing  with  the 
implementation  issue,  but  the  method  only  emphasizes  the  problem  of  small-to-medium  sized  and 
software-oriented  systems.  The  same  experience  was  found  in  the  Structured  Analysis  method 
[You]. 


The  design  of  large,  mission  critical  systems  demands  an  understanding  of  all  system 
characteristics.  The  concept  of  using  the  five  system  capture  views  (Informational,  Functional, 
Behavioral,  Implementation,  and  Environmental)  to  completely  represent  the  system  has  been 
introduced  [Hoaj.  Though  these  system  capture  views  were  identified,  much  work  will  be  needed 
to  specify  each  of  their  capturing  formats.  Also,  the  relationship  between  these  views  must  be 
identified  such  that  the  completeness  of  the  design  capture  is  fulfilled.  The  focus  of  this  paper  is 
on  the  Implementation  Capture  View  and  the  relationship  between  this  ^':9w  and  the  Functional  and 
Behavioral  Capture  View. 

The  Implementation  Capture  View  documents  the  architectural  descriptions  and 
performance  capabilities  of  all  hardware,  software,  and  human  resources  which  represent  a 
particular  embodiment  of  the  system  under  design.  The  hardware  architecture  describes  the 
physical  resources  of  the  system  including  the  components,  interconnection  topology  and  protocol. 
The  software  architecture  describes  the  Computer  Software  Configuration  Items  (CSCI)  and  the 
executable  software  tasks  including  the  messages  passed  between  modules.  Finally,  the 
Humware  Architecture  describes  the  number  of  personnel  required  to  operate  the  system  under 
various  conditions  and  the  level  of  training  and  experience  for  each  operator.  The  hardware, 
software  and  humware  architectures  are  captured  in  a  database  which  also  includes  the  resource 
selection,  design  rationale,  and  the  traceability  of  system  requirements  through  the  design. 
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The  description  of  the  resource  architectures  represents  the  principal  products  of  the 
systems  engineering  effort  and  establishes  a  baseline  for  the  detailed  design  of  the  system  and 
operating  concept.  The  architecture  descriptions  provide  a  basis  for  the  development  of  system 
specifications  {e.g.,  DOD-STD-2167A  System  Segment  Specification  'SSS),  Software  Requirements 
Specification  (SRS),  etc,).  These  specifications  in  turn  provide  the  project  software  and  hardware 
engineering  teams  with  the  basis  for  detailed  system  design.  They  also  establish  a  basis  for 
development  and  analysis  of  a  performance  simulation  for  the  system  under  design.  Simulation 
provides  the  ability  to  identify  potential  shortfalls  and  errors  in  the  system  design  early  in  the 
design  cycle  when  changes  and  corrections  are  considerably  less  costly. 

Although  the  capturing  technique  is  described  as  a  series  of  steps  or  activities  which  are 
presented  in  a  particular  order,  this  sequence  is  not  intended  to  be  a  rigid  formula  for  generating 
the  Implementation  Capture  View.  The  order  of  the  steps  represents  a  general  flow  of  activity 
which  is  intended  to  be  iterative  both  oetween  steps  and  across  the  overall  process.  The  foltov.-ing 
sections  describe  the  information  which  must  be  captured  and  a  preliminary  process  for 
accomplishing  that  capture.  It  is  important  to  note  that  this  methodology  only  emphas'aes  the 
generation  of  the  Implementation  Capture  View  based  on  a  predefined  Functional  and  Behavioral 
Capture  View. 


Identify  System/Subsystem  Function  of  Interest 

The  first  step  in  developing  an  Implementation  Capture  View  is  to  identify  a  complete 
logical  model  of  the  system  at  some  level  of  decomposition  or  design  level  of  detail.  Sub  systems 
or  components  of  a  larger  system  can  also  be  addressed;  however,  it  is  important  to  clearly  define 
the  boundaries  of  the  logical  model  to  be  implemented.  This  becomes  particularly  important  w-hen 
the  logical  and  implementation  models  of  the  system  (or  subsystem)  under  design  are  simulated 
Simulation  of  a  given  design  requires  explicit  external  interface  definitions  and  an  unambiguous 
definition  of  the  sim/stim  requirements, 

This  step  is  a  precursor  to  developing  a  function-resource  mapping  in  that  it  defines  a 
consistent  and  complete  function  listing  for  the  system  under  design  at  a  given  level  of  detail  The 
mapping  may  be  accomplished  at  any  level  of  functional  decomposition;  however,  the  functions 
mapped  must  represent  the  entire  system  (or  clearly  defined  sub-system).  High  or  low  levels  of 
abstraction  (or  some  combination)  may  be  used,  however,  no  redundant  function  capture  is  allowed 
(i.e.,  function  and  its  children).  An  example  of  identifying  a  complete  logical  model  at  a  given  level 
is  illustrated  in  Figure  1.  This  example  shows  the  example  sonar  problem  functional  decomposition 
down  to  three  levels  from  the  context  diagram. 


Figure  1 .  Example  Sonar  System  Functional  Decomposition 
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The  soud  lines  conneciino  ihe  Ignciions  »ep/es«ni  the  pafent/chiSd  feiationships  tn  ihie 
decomposition.  The  dashed  !<•>«  encloses  a  set  of  functkons  *ivhich  represent  the  ent>re  system 
Note  that  functions  at  two  different  levels  are  used  to  represent  the  system,  but  rto  overlap  P  e 
inclusion  of  a  function's  parent  or  child)  is  aliowed  Any  overlap  in  this  selection  of  tunctions  to 
represent  the  system  would  result  m  duplication  and  ambiQuity  tn  the  associated  data  fiov.  mode! 


Resource  library 

A  resource  library  ts  a  repository  for  all  candidate  resources  iftat  can  be  used  of  built  mto 
the  design  process  The  resource  library  e*i$ts  m  the  form  of  a  database  where  alt  candidate 
resources  are  categorued  from  a  very  general  class  to  a  specitic  type  White  the  descriptions  and 
performance  capabilities  of  the  off -the  shelf  products  are  documented  according  to  their 
manufacturing  specifications,  the  'to-be-designed*  or  modified  resources  are  listed  bv  their 
expected  values  including  their  design  constraint  The  multi  level  resource  classification  aiicws  the 
early  predict  v.  i  of  system  performance  without  constraining  the  design  to  a  specific  type  of 
resource 

The  resource  library  includes  both  "black  bo*'  resources  and  compie*  resources  Biac*  bc» 
resources  are  described  in  terms  of  their  physical  characteristic,  performance,  and  other  design 
factors  but  are  not  decomposed  into  component  parts  This  does  not  imply  that  black  bo* 
resources  are  simple,  only  that  their  components  are  not  described  m  the  resource  library  Compse* 
resources  are  also  described  m  terms  of  selected  design  factors  but.  m  addition,  mciurfe  a 
component  level  description  which  identifies  the  constituent  parts  of  the  complex  resource  and  the 
internal  interconnection  of  the  parts  Cross  references  are  provided  for  component  parts  and 
related  sub-systems  which  are  also  found  elsewhere  m  the  resource  library  A  graphical  user 
interface  which  provides  point  and  click  access  to  the  resource  descriptions  contained  m  the  library 
is  envisioned  to  facilitate  ease  of  use 

The  resource  library  can  be  used  as  a  reference  point  for  developing  the  system  resource 
architectures  and  must  be  established  before  the  function-resource  mapping  process  Figure  2 
illust.rates  an  c  ample  hierarchy  for  organizing  the  resource  library  used  m  developing  the  passive 
sonar  system  example. 


Resource  Description 

Regardless  of  its  type  or  class,  each  resource  description  must  be  documented  m  detail 
An  appropriate  format  for  capturing  each  type  of  resource  should  be  generated  to  formalize  the 
representation.  The  format  contains  a  number  of  fields  which  describe  the  resource  m  terms  of 
selected  design  factors,  interfaces,  and  components  (if  provided)  Multiple  fields  are  provided  for 
each  design  factor  allowing  specified  values,  measured  values,  etc  ,  to  be  entered  and  maintained 
Design  rationales  and  alternative  resources  are  also  listed  for  alternate  designs  Figure  3  illustrates 
the  description  of  an  example  format  for  a  general  purpose  CPU  type  resource 
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Figure  2.  Example  of  a  Resource  Library  Organization 
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Figure  3.  Example  format  for  a  General  Purpose  One  MIP  CPU 
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Multi-levei  Function-Resource  Mapping 

The  multi-levei  function-resource  mapping  provides  a  mechanism  for  allocating  funct»ons  to 
resource  types  and  ultimately  for  mapping  those  functions  indirectly  to  specific  resources.  The 
mapping  establishes  a  strong  link  between  the  logical  architecture  and  the  resource  architecture 
and  requires  that  H)  every  function  be  fully  implemented  in  resources,  and  (2)  every  resource  be 
traceable  back  to  a  required  function  (or  a  derived  system  service  task). 

Once  a  complete  functional  listing  is  established  and  a  preliminary  resource  library  exists, 
the  function-resource  mapping  can  be  performed  in  various  layers  or  steps.  The  function-resource 
mapping  process  is  intended  to  be  iterative  and  as  such  must  accommodate  numerous  changes. 
Although  in  the  following  discussion  the  mapping  is  shown  in  three  layers,  it  can  vary  depending  on 
how  much  trade-off  systems  engineers  would  like  to  estimate  .  Opportunities  exist  at  each  step  or 
layer  to  optimize  the  mapping  according  to  a  variety  of  design  factors.  However,  increasing  the 
number  of  layer  of  the  mapping  will  effect  the  development  cost;  hence,  system  engineers  should 
have  an  appropriate  mapping  plan  to  suit  their  requirement.  At  each  level  of  mapping,  analysis 
techniques  can  be  applied  to  verify  the  correctness  of  the  design.  Figure  4.  illustrates  the  three 
level  function-resource  mapping  of  a  passive  sonar  system. 

The  first  layer  of  the  function-resource  mapping  is  an  assignment  of  the  system  functions 
to  generalized  resource  classes.  Each  function  is  mapped  to  one  of  these  generalized  resource 
classes  which  is  then  further  refined  by  specifying  a  certain  resource  type  from  that  class.  For 
example,  the  beamforming  function  is  mapped  to  the  general  class  of  Programmable  Hardware  S 
Software,  which  is  further  specified  as  special  purpose  custom  beamformer  hardware  with 
beamformer  microcode  software. 

The  second  layer  of  the  function-resource  mapping  establishes  a  set  of  implementation 
tasks  which  will  be  performed  by  the  specific  resource  classes  or  types.  Many  of  these 
implementation  tasks  are  directly  related  to  the  functions  and  may  represent  an  implementation 
specific  functional  partitioning,  grouping  or  some  combination  of  the  two  which  reflects  the 
intended  implementation.  Other  implementation  tasks  are  created  to  provide  required 
implementation  specific  system  services  such  as  CPU  operating  systems,  database  managers  and 
network  executives.  This  layer  of  the  mapping  provides  the  systems  engineer  with  a  mechanism 
for  repartitioning  the  functional  decomposition  for  implementation  without  directly  modifying  the 
Functional  Capture  View. 

The  third  layer  of  function-resource  mapping  is  the  allocation  of  the  implementation  tasks  to 
specific  resources.  At  this  level  the  description  of  the  candidate  resources  are  detailed  enough 
such  that  a  framework  for  the  system  design  can  be  constructed.  In  this  layer  of  the  mapping 
individual  resources  are  identified  by  hardware  unit  number,  software/humware  task  and  human 
operator  for  each  implementation  task.  Many  implementation  tasks  require  both  hardware  and 
software  resources.  The  static  allocation  of  implementation  tasks  to  specific  combinations  of 
hardware  and  software  represents  an  over  simplification  of  the  system's  hardware-software 
mapping.  In  a  system  with  alternate  program  load  options  or  in  systems  where  software  tasks  are 
dynamically  allocated  to  hardware  this  level  of  the  mapping  simply  represents  a  particular  instance 
of  the  possible  software  hardware  mapping. 
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Figure  4.  Three  Levels  of  Function-Resource  Mapping  of  the  Passive  Sonar  Example 
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Resource  Architecture 


Resource  architecture  is  the  representation  of  system  resources  including  their 
interconnection  and  utilization.  With  the  multi-level  resource  description,  the  resource 
architectures  can  be  constructed  at  any  mapping  level.  The  accuracy  of  the  analysis  is  influenced 
by  the  abstraction  level  of  the  resource  descriptions.  When  the  mapping  reaches  cenain  levels  of 
detail,  the  developed  architectures  will  not  only  reflect  the  resources  which  will  perform  the 
required  system's  functions,  but  also  the  resources  that  provide  the  system  support  functions  (i  e  . 
operating  system  software).  In  general,  each  architecture  represent  the  network  of  the  same  type 
of  resources.  In  the  computer-based  system,  three  resource  architectures,  hardware,  software, 
and  humware  are  usually  addressed.  The  following  examples  of  the  resource  architectures  reflect 
the  third  level  of  mapping  that  was  mentioned  above. 

The  hardware  architecture  includes  a  description  of  the  hardware  components  of  the 
system  and  their  interconnection.  Information  about  the  hardware  that  was  selected  in  the 
function-resource  mapping  is  put  together  in  the  form  of  diagrams  and  a  data  base.  The  diagrams 
show  the  locations,  components,  and  physical  interconnection  of  the  various  hardware  units.  In 
addition  to  the  fields  provided  in  the  resource  library,  the  hardware  architecture  data  base  includes 
information  on  the  selection  rationale  and  requirements  traceability  for  each  hardware  resource  in 
the  system,  the  installed  location  (both  in  physical  terms  and  in  terms  of  the  interconnection 
topology  with  other  resources),  and  a  description  of  any  messages  sent  and  received  by  the 
hardware  which  are  specific  to  that  hardware  and  not  due  to  the  software  running  on  that 
hardware  (i.e.,  messages  sent  or  received  by  the  software  which  runs  on  the  hardware  are 
described  in  the  software  architecture).  Additional  fields  are  also  provided  for  each  design  factor 
which  is  chosen  to  describe  the  resource.  This  allows  required  and  budgeted  values  of  interest  to 
be  captured.  Figure  5.  shows  a  candidate  hardware  architecture  of  the  passive  sonar  system 


Figure  5.  Passive  Sonar  System  Hardware  Architecture 
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Similar  to  hardware  architecture,  the  software  architecture  includes  a  description  of  the 
various  software  modules  of  the  system.  Each  module  is  described  in  terms  of  the  processing  and 
algorithms  implemented,  selected  design  factors  addressing  throughput  requirements,  memory 
requirements,  etc.,  and  a  description  of  the  messages  sent  and  received  by  the  module.  The 
software  architecture  description  is  also  intended  to  contain  other  information  typically  included  m 
a  Software  Requirements  Specification  (SRSI  as  defined  by  the  DOD-STD-2167A  DID  #DtMCCR 
80025A.  The  software  architecture  can  be  represented  using  various  graphical  forms  including  a 
listing  of  source  code  modules  with  calling  relationships,  and  message  flow  between  modules,  etc 
The  software  architecture  database  contains  the  necessary  information  to  construct  these  system 
software  representations  and  to  support  modeling  of  system  performance.  Figure  6a.  shows  an 
example  format  of  the  software  architecture  of  the  detection  function  and  figure  6b.  shows  a 
description  of  a  software  task  within  that  architecture. 


Figure  6a.  Software  Architecture  of  the  Detection  Function 
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Figure  6b.  Description  of  the  Detection  Software  Task 


Finally,  the  humware  architecture  describes  the  number  and  type  of  personnel  (i.e.,  training, 
and  experience  levels)  required  to  man  the  system  under  various  operating  conditions.  In  the  case 
when  human  function  is  a  large  part  of  the  system  under  design,  the  organization  chart  is 
considered  as  part  of  this  architecture.  Organization  breakdown  and  information  interchanges 
between  different  departments  are  captured  with  a  similar  format  that  was  described  in  the  capture 
of  hardware  or  software.  Figure  7.  illustrates  a  human  resource  architecture  of  the  passive  sonar 
system  and  an  example  description  of  one  particular  operator,  detection  operator. 


Conclusion 

The  above  discussion  is  an  attempt  to  represent  the  implementation  aspect  of  the  system 
and  its  relationship  with  the  functional  aspect.  The  issue  here  is  not  the  preference  of  one  notation 
over  another,  rather,  it  is  the  need  for  a  robust  technique  to  describe  system  resources  from  a 
detailed  specification  of  a  simple  component  to  a  high  level  abstraction  of  a  complex  part.  With 
the  flexibility  to  maintain  multiple  design  options  and  the  ability  to  analyze  these  options,  systems 
engineers  will  have  more  confidence  with  their  design.  There  is  no  doubt  that  the  above 
techniques  should  be  automated;  however,  early  commitment  in  certain  CASE  tools  or  technologies 
can  restrict  the  intention  of  the  methods.  With  the  intention  to  support  various  types  of  analysis, 
the  implementation  representation  will  also  face  the  problem  of  how  to  automatically  transform  its 
capture  information  to  different  models  of  analysis. 
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Figure  7.  Human  Resource  Architecture  of  the  Passive  Sonar  System 
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ABSTRACT 

The  capture  of  large  complex  real-time  system  designs  requires  organization  and  representation  of  a 
large  diverse  body  of  technical  information  and  data.  While  most  system  design  capture  or 
specification  techniques  address  the  external  interfaces  to  the  system  under  design,  no  formal, 
structured  approach  for  capturing  the  full  spectrum  of  external  and  environmental  factors  which 
impact  the  system  design  has  been  established.  This  paper  addresses  the  preliminary  definition  of 
an  Environmental  Capture  View  within  the  context  of  a  multi-domain  design  capture  and  analysis 
methodology.  The  elements  of  the  Environmental  Capture  View  are  intended  to  provide  a 
structured  representation  and  organization  of  the  following  types  of  information:  operational 
scenarios,  system  concept  of  operations,  environmental  conditions,  external  systems  and 
interfaces,  test  strategies,  maintenence  and  logistics  considerations,  and  other  external  factors 
which  impact  the  system  under  design.  Candidate  methods  for  representing  and  organizing 
selected  elements  are  discussed  with  examples  along  with  suggestions  for  the  direction  of  future 
work  in  this  area. 


INTRODUCTION 

The  capture  of  large  complex  real-time  system  designs  requires  organization  and  representation  of  a 
large  diverse  body  of  technical  information  and  data.  Exisiting  systems  engineering  tools  and 
methodologies  [1],  [2],  [3]  offer  multi-domain  approaches  to  representing  key  aspects  of  the 
system  design  such  as  systems  functions  and  data,  hardware  and  software  architecture.  While 
most  system  design  capture  or  specification  techniques  address  the  external  interfaces  to  the  system 
under  design,  no  formal,  structured  approach  for  capturing  the  full  spectrum  of  external  and 
environmental  factors  which  impact  the  system  design  has  been  established. 

This  paper  addresses  the  preliminary  definition  of  an  Environmental  Capture  View  within  the 
context  of  a  multi-domain  design  capture  and  analysis  methodology  as  described  by  N.D.  Hoang 
[4].  The  five  capture  views  defined  within  the  methodology  are  briefly  summarized  below  to 
provide  a  context  for  the  main  subject  of  this  paper  which  is  definition  of  the  Environmental 
Capture  View.  The  essential  elements  of  the  Environmental  Capture  View  are  defined  herein,  and 
several  candidate  methods  for  representing  and  organizing  the  information  for  selected  elements  of 
the  Environmantal  Capture  View  is  proposed. 


FIVE  VIEWS  OF  A  COMPLEX  SYSTEM 

The  central  element  of  the  multi-domain  design  capture  and  analysis  methodology  is  definition  of 
the  multiple  design  domains  or  Capture  Views  of  the  system  which  address  the  principal  system 
design  perspectives:  (1)  Environmental,  (2)  Informational,  (3)  Functional,  (4)  Behavioral,  and  (.‘i) 
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Implementation.  The  five  views  partition  the  system  design  into  logical  segments  corresponding  to 
key  perspectives  of  the  system  design.  The  partitioning  of  the  system  design  capture  into  these 
five  Capture  Views  has  evolved  beginning  with  the  NAVSWC  ASW  Methods  and  I'ools  Group 
[5]  and  has  been  described  in  its  present  form  by  N.  D.  Hoang  (4j.  The  five  Capture  Views  arc 
briefly  summarized  in  Table  1 . 


Table  1 

FIVE  VIEWS  OF  A  COMPLEX  SYSTEM 


DESIGN  VIEW 
Environmental  View 


Informational  View 


Functional  View 


Behavioral  View 


Implementation  View 


VIEW  OBJECTIVES 

Establish  Conditions  and  Events 
Constraining  System  Operations 
Specify  Performance  MOEs  and 
Conditions  of  Measurement 


Characterize  System  Concept  of 
Operations 

Represent  System  Component's 
in  Abstract  Terms 

Define  System  Functions  and 
Decompositions 

Specify  Data  Flow  Requirements 

Define  System  States  and 
Triggers 

Specify  System  Behavior 
Characteristics 

Define  the  Physical  Hardware, 
Software  and  Human  Resources 
Which  Make  up  the  System 
Specify  System  Physical 
Inlerconnectivity 


DESIGN  ELEMENTS 

•  Environmental  Conditions  and 
Event  Descriptions 

•  Externa]  System  Descriptions 

•  System  Initial  Conditions 

•  Measures  of  Effectiveness 

•  Entity  -  Relationship  Diagrams 

•  Attribute/Method  Descriptions 


•  Funclion/Daia  Flow  Diagrams 

•  Process  Specifications 

•  Data  Dictionary 

•  Control  Flow  Diagrams 

•  State  Transition  Diagrams 

•  Control  Specifications 


•  Hardware,  Software  and  Human 
Resource  Descriptions 

•  Performance  Parameters  and 
Resource  Characteristics 

•  Function 'Resource  Mapping 


An  attempt  to  address  all  of  issues  associated  these  views  simultaneously  without  a  structured 
methodology  is  a  multi-dimensional  problem  of  a  magnitude  which  exceeds  the  capacity  of  most  if 
not  all  systems  engineers.  Each  of  these  views  provide  key  information  concerning  particular 
aspects  of  the  system  under  design.  Taken  individually  the  views  allow  the  systems  engineer  to 
partition  the  design  and  analysis  of  a  proposed  or  existing  system  into  manageable  parts. 

The  capture  approach  for  these  design  domains  or  views  share  a  common  hierarchiai  structure 
which  supports  management  of  the  magnitude  and  complexity  associated  with  a  large  system 
design.  Flat  representations  of  complex  system  designs  rapidly  become  unwieldy  as  the  design 
detail  unfolds.  A  hierarchiai  structure  allows  the  system  views  to  be  represented  at  various  levels 
of  detail  from  a  broad  top  level  which  encompass  the  breadth  and  scope  of  the  system  and  its 
external  interfaces  to  very  low  levels  which  describe  the  details  of  a  particular  segment  of  the 
system  design. 
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Each  design  view  represents  the  system  from  a  particular  perspective  and  highlights  different 
aspects  of  the  design,  however  they  are  not  independent.  The  views  represent  different  aspects  of 
the  same  system  ana  therefore  must  be  consistent.  Tlie  features  of  a  panicular  view  can  directly  or 
indirectly  impact,  to  a  greater  c>r  lesser  degree,  the  design  in  another  view  depending  on  how  the 
relationship  between  the  views  is  specified. 


SYSTEM  DESIGN  CAPTURE  AND  ANALYSIS  AUTOMATION 

Successful  employment  of  a  multi-domain  design  capture  and  analysis  methodology  in  supporting 
a  complex  system  design  is  largely  a  function  of  the  degree  of  mechanization  which  can  be 
achieved.  The  size  and  complexity  of  large  scale  advanced  computer  systems  render  manual 
application  of  any  design  process  or  meth^  unusable.  Considerable  potential  benefits  can  be 
gained  from  automation  which  supports  a  disciplined  structured  capture  of  the  initial  iteration  of  a 
system  design  and  subsequent  editing  of  that  capture.  Further  significant  efficiencies  can  be  gained 
through  automated  consistency  and  completeness  checking  within  and  between  the  five  system 
views  which  represent  the  captured  design.  However,  in  light  of  the  overwhelming  systems 
engineering  task  represented  by  analysis  of  an  advanced  complex  system  design,  the  most 
significant  productivity  gains  are  in  automated  support  for  design  simulation  and  analysis  within  an 
integrated  and  highly  automated  design  capture  and  analysis  environment. 


DEFINING  THE  ENVIRONMENTAL  CAPTURE  VIEW 

The  Environmental  Capture  View  is  defined  as  the  structured  representation  and  organization  of  the 
following  types  of  information:  operational  scenarios,  concept  of  operations,  environmental 
conditions,  external  systems  and  interfaces,  test  strategies,  maintenence  and  logistics 
considerations,  and  other  external  factors  which  impact  the  system  under  design.  The  infomiation 
captured  in  the  Environmental  Capture  View  is  necessary  to  address  important  issues  of  a  system 
under  design  but  may  not  typically  be  included  in  the  design  itself  The  key  elements  of  the 
Environmental  Capture  View  are  listed  below  with  a  brief  description.  The  following  sections 
address  each  of  these  elements  in  some  detail  including  selected  examples. 

Operational  Scenarios  describe  situations  and  sequences  of  external  events  which  the 
system  must  address.  The  most  stressing  cases  and  most  likely  cases  are  identified  and 
described  from  the  potentially  large  spectmm  of  possible  operational  scenarios. 

The  Concept  of  Operations  describes  the  proposed  approach  for  operation  of  the  system 
under  design  from  the  operators  perspective. 

External  Systems  and  Interfaces  to  the  system  are  captured  including  information  necessaiy 
to  develop  an  external  system  model  to  support  system  design  simulation  and  testing. 

The  Environmental  Conditions  under  which  the  system  must  operate  may  include 
geographic,  meterologic,  electromagnetic,  and  acoustic  environmental  factors  as  well  as 
others  associated  with  the  systems  operational  surroundings. 

System  Test  Strategies  for  testing  system  compliance  with  the  top  level  system 
requirements  are  captured  including  system  performance  metrics,  system  test  approach  and 
test  procedures,  system  Sim/Stim,  and  test  instrumentation. 
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Maintenence  and  Logistics  considerations  associated  with  the  system  through  its  life  cycle 
which  impact  the  design  are  also  captured. 

Other  External  Factors  which  influence  and  affect  the  system  under  design  are  also 
documented  such  as  design  constraints  and  guidance  imposed  by  the  program  development 
sponsor. 


The  elements  of  the  Environmental  Capture  View  including  the  external  interfaces  and 
environmental  conditions  are  desribed  and  captured  in  terms  which  are  compatible  with  the 
Informational,  Functional,  Behavioral,  and  Implementation  Capture  Views  such  that  the 
consistency  across  the  various  capture  views  can  be  analyzed.  These  descriptions  also  establish  a 
basis  for  development  and  analysis  of  a  performance  simulation  for  the  system  under  design. 
Simulation  provides  the  ability  to  identify  potential  shortfalls  and  errors  in  the  system  design  early 
in  the  design  cycle  when  changes  and  corrections  are  considerably  less  costly.  The  descriptions 
also  provide  important  information  needed  for  system  integration  and  testing  as  well  as  for 
development  of  system  external  interface  specifications. 

It  is  important  to  note  that  this  paper  does  not  address  a  particular  systems  engineering  design 
process.  Instead,  it  describes  a  technique  and  methodology  for  capturing  the  Environmantal 
Capture  View  of  a  complex  system  design  reguardless  of  the  systems  engineering  process  model 
employed.  This  paper  describes  the  current  evolution  of  the  Environmental  Capture  View  and 
represents  a  snapshot  of  an  ongoing  effort  to  formalize  the  elements  of  the  Environmental  Capture 
View  and  the  techniques  for  documenting  those  elements.  Examples  extracted  from  a  pa.ssive 
sonar  system  sample  problem  [6]  are  employed  in  this  paper  to  illustrate  the  capture  techniques  and 
to  describe  the  rational  for  the  methods  employed. 


OPERATIONAL  SCENARIOS 

This  element  of  the  Environmental  Capture  View  identifies  and  describes  the  operational  scenarios 
which  are  expected  to  be  encountered  by  the  system  under  design  both  in  the  near-term  and 
far-term  over  the  system’s  projected  life.  The  capture  of  operational  scenarios  is  intended  to 
establish  bounds  on  the  spectrum  of  possible  operational  scenarios,  identifiy  key  scenarios  which 
represent  the  most  likely  and  most  stressing  cases  for  the  system  under  design,  and  describe 
selected  key  scenarios  in  detail.  These  key  scenarios  are  captured  in  sufficient  detail  to  establish 
test  cases  for  system  performance  simulation  and  analysis. 

Scenarios  captured  using  the  approach  described  in  this  section  could  be  maintained  in  a  scenario 
library  and  reused  for  multiple  system  designs  where  applicable.  An  example  of  where  this 
approach  would  be  particularly  usefull  is  the  set  of  seven  scenarios  promulgated  recently  by  the 
U.S.  military  Joint  Chiefs  of  Staff  (JCS).  This  set  of  seven  scenarios  has  been  approved  for  use 
in  establishing  system  development  guidance  and  provides  a  vehicle  for  evaluating  the  warfighting 
utility  of  existing  and  proposed  systems. 
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SPECTRUM  OF  OPERATIONAL  SCENARIOS 


The  first  component  to  be  addressed  in  capturing  the  operational  scenarios  is  a  mechanism  for 
bounding  and  describing  the  wide  spectrum  of  possible  operational  scenunos  for  the  system  under 
design.  The  set  of  possible  scenarios  is,  in  general,  very  large  and  can  not  be  efficiently 
represented  by  a  list  of  scenario  cases.  The  approach  we  will  employ  here  is  to  establish  the  ke>- 
parameters  which  characterize  the  scenarios  and  then  to  establish  the  range  of  possible  cases  lor 
each  parameter.  These  parameters  provide  a  mechanism  to  characterize  entire  classes  of  scenanos 
and  to  identify  the  most  likely  and  most  stressing  cases  to  support  system  simulation  and  analysis. 

Operational  scenarios  can  be  characterized  in  many  ways  s  ich  as  the  geographic  setting, 
environmental  conditions,  the  objectives  of  the  system  users,  the  it  tensity  of  operations,  etc.  A  set 
of  parameters  must  first  be  identified  and  defined  which  describe  the  essential  elements  of  an 
operational  scenario  for  the  system  under  design.  These  parameters  should  be  selected  to  be 
orthogonal  if  possible  to  allow  any  combination  of  the  various  parameniers  to  be  selected.  A  range 
of  values  or  cases  for  each  scenario  parameter  is  also  identifed.  The  goal  is  to  establish  a  set  of 
scenario  parameters  and  parameter  values  which  can  uniquely  characterize  any  possible  scenario 
which  the  system  under  design  may  encounter. 

A  simplified  set  of  scenario  parameters  for  the  passive  sonar  system  sample  problem  is  illustrated 
below.  Five  parameters  have  been  identified  to  describe  the  spectrum  of  operational  scenarios 
which  the  sample  passive  sonar  system  is  intended  to  operate  and  which  can  be  used  to  uniquely 
identify  a  particular  scenario: 

(1)  Tempo  of  Operations  -  Defined  as  the  level  of  operational  intensity  associated  with  the  platfomi 
mission  and  the  system’s  role  in  supponing  that  mission.  The  range  of  possible  cases  for  tiie 
Tempo  of  Operations  scenario  parameter  is  illustrated  as  follows: 

Low  -  Independent  transit  in  international  waters 

Moderate  -  CVBF  screening  operations  in  open  ocean 

High  -  Threat  submarine  tracking  operations 


(2)  Level  of  Conflict  -  Defined  as  the  combat  readiness  or  alert  state  of  the  platform  and  related  to 
the  likelihood  of  hostile  actions.  The  range  of  possible  cases  for  the  Level  of  Conflict  scenario 
parameter  is  illustrated  as  follows: 

Peacetime 
Crisis  Response 
Transition  to  War 
Regional  Conflict 
Global  Conventional  War 


(3)  Environmental  Acoustic  Conditions  -  Defined  as  the  description  of  prevailing  environmental 
conditions  which  affect  acoustic  sensor  performance.  A  simplified  range  of  possible  cases  for  the 
Environmental  Acoustic  Conditions  scenario  parameter  is  illustrated  as  follows: 

High  Ambient  Noise/High  Propagation  Loss 
High  Ambient  Noise/M^erate  Propagation  Loss 
High  Ambient  Noise/Low  Propagation  Loss 
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Moderate  Ambient  Noise/Htgh  Propagation 
Moderate  Ambient  Noisc/Mtxieratc  Propagation  laiss 
Moderate  Ambient  Notse/lxiw  l^opagarion  taos 
Low  Ambient  Notsc/lligh  Propagation  Los*. 

Low  Ambient  Noisc/Modcrutc  Propagation  I  i>ss 
Lx)w  Ambient  Norse/lxm  Propagation  la)^^ 


(4)  Acoustic  Contact  Density  -  Defined  as  the  numlser  ol  contacts  wnlun  .H  otiMic  dc'cctuus  ra.ncc 
of  the  system.  A  simplified  range  of  possible  cases  for  tl.c  Acoustr.  (  ‘'nLiCt  Dcnoi;.  s.m.-.r.i 
parameter  is  illustrated  as  follows: 

lx)w  -  less  than  3  simultaneous  acoustic  contacfs 
Mexierate  -  3  to  10  smuiliancous  acoustic  con!an.!s 
High  -  10  to  15  simultaneous  acoustic  contact . 

Very  High  -  greater  than  15  simultaneous  acoustic  contacts 


(5)  System  Operating  Mode  -  Defined  as  the  current  readiness  *  nt  the  system  A  simpIiJicd 
range  of  possible  cases  for  the  System  Operating  Mrxk  scenarf  nnrc.r'ctcr  i ,  i!!ustraicd  as  foUims 

F'ull-up  -  AH  functional  capabilities  available 

[>:graded  Mixie  -  Only  detection  and  limned  tracking  capabdinc-  available 


The  scenario  parameters  and  the  ranges  of  possible  cases  or  values  for  each  scenario  paranieier  arc 
then  configured  in  a  scenario  mainx  which  allows  the  user  to  summan/e  the  spectrum  of  possible 
scenarios  in  a  compact  form.  Individual  scenarios  can  easily  fv  csmicteii  form  the  scenario  matrix 
by  selecting  one  value  or  case  from  each  of  the  scenario  parameters  in  the  scenario  matrix  This 
also  creates  a  short  hand  method  for  labeling  scenarios. 
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Figure  1  Passive  Sonar  System  Example  Scenario  Matrix 
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The  scenario  matrix  for  the  passive  sonar  system  is  illustrated  in  F-igure  1 .  l£vcn  in  this  simplified 
example  the  number  of  possible  scenarios  described  by  this  scenario  matrix  for  the  passive  sonar 
system  totals  1080  (3x5x9x4x2).  Describing  in  detail  1080  scenarios  is  not  a  reasonable  approach 
to  take  nor  is  it  necessary.  The  next  two  sections  describes  a  method  for  selecting  a  small  subset  of 
the  possible  scenarios  which  represent  the  most  likely  scenarios  and  the  most  stressing  scenarios. 
These  few  selected  scenario^  will  then  be  described  in  detail  and  will  form  the  basis  for  system 
analysis  and  testing. 


MOST  LIKELY  SCENARIOS 

The  set  of  most  likely  (or  most  common)  scenarios  which  the  system  under  design  will  encounter 
is  derived  by  examining  the  spectrum  of  possible  scenarios  as  embodied  in  the  scenario  matrix. 
The  most  likely  range  for  each  of  the  scenario  factors  is  determined  and  then  the  most  likely 
combinations  of  the  parameters  are  identified.  Understanding  the  .set  of  most  likely  scenarios 
represents  an  important  aspect  in  developing  the  system  design.  These  .scenarios  also  define  one 
set  of  test  cases  which  can  be  used  to  simulate  and  analyze  the  system  under  design. 
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Figure  2  Scenario  Matrix  With  Most  Likely  Cases  Highlighted 

Based  upon  combining  the  most  likely  scenario  parameters  identified  in  Figure  2,  the  following 
two  .scenarios  represent  the  most  likely  scenarios  for  the  sample  sonar  system: 

Likely  Scenario  #1  -  A  moderate  tempo  of  operations  in  a  peacetime  situation  in  a  moderate 
acoustic  environment  with  moderate  contact  density  and  the  system  in  the  full-up  operating  mode. 

Likely  Scenario  #2  -  A  moderate  tempo  of  operations  in  a  crisis  response  situation  in  a  moderate 
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acoustic  environment  with  moderate  contact  density  and  the  system  in  the  full-up  operating  mode. 


MOST  STRESSING  SCENARIOS 

The  set  of  most  stressing  scenarios  which  the  sample  passive  sonar  system  will  encounter  is  also 
derived  by  examining  the  spectrum  of  possible  scenarios.  The  most  stressing  cases  for  each  of  the 
scenario  factors  is  determined  and  then  the  most  stressing  combinations  of  the  five  parameters  are 
identified.  Understanding  the  set  of  most  stressing  scenarios  represents  an  important  aspect  in 
developing  a  system  design  which  is  operable  and  meets  the  intent  of  the  system  requirements. 
These  scenarios  also  define  a  second  set  of  test  cases  which  can  be  used  to  simulate  and  analyze  the 
system  under  design. 
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^  3  Scenario  Matrix  With  Most  Stressing  Cases  Highlighted 


Based  upon  combining  the  most  stressing  scenario  parameters  identified  in  Figure  3,  the  following 
scenarios  represent  the  most  stressing  scenarios  for  the  sample  sonar  system: 

Stresing  Scenario  #1  -  A  high  tempo  of  operations  in  a  regional  conflict  situation  in  a  high  AN/PL 
acoustic  environment  with  very  high  contact  density  and  the  system  in  the  full-up  operating  mode. 

Stresing  Scenario  #2  -  A  high  tempo  of  operations  in  a  transition  to  war  situation  in  a  high  AN/PL 
acoustic  environment  with  very  high  contact  density  and  the  system  in  the  full-up  operating  mode. 

Stresing  Scenario  #3  -  A  high  tempo  of  operations  in  a  regional  conflict  situation  in  a  high  AN/PL 
acoustic  environment  with  very  high  contact  density  and  the  system  in  the  degraded  operating 
mode. 
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Stresing  Scenario  #4  -  A  high  tempo  of  operations  in  a  transition  to  war  situation  in  a  high  AN/PL 
acoustic  environment  with  very  high  contact  density  and  the  system  in  the  degraded  operating 
mode. 


SCENARIO  DESCRIPTIONS 

The  system  scenario  descriptions  are  captured  in  a  graphic  and  textual  format  which  is  tailored  to 
support  documentation  of  the  information  and  data  required  to  fully  describe  the  conditions,  events 
and  other  features  of  a  given  scenario.  The  following  outline  provided  in  Figure  4  represents  a 
preliminary  baseline  format  for  capturing  scenarios  associated  with  large  scale  military  systems. 
This  baseline  format  is  intended  to  be  tailored  for  use  in  specific  programs.  An  example  scenario 
description  is  currently  being  developed  using  this  approach  for  the  sample  sonar  system  described 
in  reference  5. 
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Figure  4  Scenario  Description  Contents 


CONCEPT  OF  OPERATIONS 

The  system  concept  of  operations  describes  a  high-level  philosophy  for  the  conduct  of  specific 
operations  employing  the  system  under  design.  During  the  early  phases  of  the  system 
development  process,  the  concept  of  operations  description  is  developed  such  that  it  applys  to  a 
broad  spectrum  of  possible  system  implementations  and  is  not  constrained  by  a  specific  candidate 
system  design,  personnel  manning  plan,  hardware/sofiware  implementation  or  existing  system 
operator-machine  interfaces.  The  focus  is  on  describing  the  concept  for  mission  execution  using 
the  system  at  a  high  level  without  reguard  to  a  specific  physical  system  implementation.  As  the 
system  design  is  developed  over  time,  the  level  of  detail  captured  in  the  concept  of  operations  can 
then  be  refined  to  address  implementation  specific  design  features. 

Complex  systems,  and  in  particular  military  systems,  often  are  designed  to  address  a  wide 
spectrum  of  operational  scenarios  and  therefor  may  have  a  correspondingly  robust  concept  of 
operations.  The  concept  of  operations  for  a  complex  system  may  have  several  variations  which  are 
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a  function  of  the  particular  type  of  scenario  which  is  being  addressed.  The  system  concept  of 
operations  as  captured  in  the  Environmental  Capture  View  is  not  intended  to  address  the  entire 
spectrum  of  possible  operational  scenarios  but  will  be  described  for  a  panicular  scenario  or  class  of 
scenarios.  In  general,  one  basic  concept  of  operations  associated  with  a  selected  scenario  (e.g.  one 
of  the  most  likely  or  most  stressing  cases)  will  be  captured  with  variations  to  describe  the  principal 
differences  in  the  operations  for  the  key  scenarios  under  consideration  (e.g.,  the  most-likely  and 
most-stressing  scenario  cases). 

The  technique  currently  identified  for  capturing  the  system  concept  of  operations  is  a  structured 
english  text  document  augmented  by  time  lines  which  capture  control  and  display  utilization  and 
operational  sequence  diagrams.  The  outline  for  the  concept  of  operations  document  is  provoded  in 
Figure  5.  As  with  the  scenario  description  document,  the  format  is  intended  to  be  tailored  for  use 
in  specific  programs. 


1 .0  Operator  Machine  Interface  Configuration  Description 

2.0  System  Modes  of  Operation 

3.0  Operator  Relationships  and  Activities 

4.0  Control  and  Display  Utilization 

5.0  Operational  Sequence  Diagrams 

Figure  5  Concept  of  Operations  Document  Outline 


Additional  work  is  required  in  this  area  to  identify  alternative  formal  methods  for  capturing  system 
concepts  of  operations  which  may  be  more  compatible  with:  (1)  future  design  capture  automation 
techniques;  (2)  graphical  system  representation  and  display  methods;  and  (3)  requirements  and 
design  rational  traceability  techniques. 


EXTERNAL  SYSTEMS  AND  INTERFACES 

This  element  of  the  Environmental  Capture  View  identifies  and  describe.s  the  external  systems 
which  have  some  relationship  to  the  system  under  design  as  well  as  the  external  interfaces  from 
these  external  systems  to  the  system  under  design.  The  external  interfaces  between  the  system 
under  design  and  the  natural  environment  (e.g.  temperature  and  pressure  transducers)  and  are  al.so 
described.  The  principal  purpose  in  describing  the  external  systems  is  to  provide  a  means  for 
simulating  their  behavior  and  modeling  their  interfaces  to  the  system  under  design.  This 
information  is  necessary  to  create  an  external  model  which  can  be  used  for  the  purpose  of  modeling 
performance  of  candidate  design  implementations  during  early  design  phases  or  for  interface 
stimulation  during  testing  of  the  system. 

This  element  of  the  Environmental  Capture  View  also  serves  to  summarize  the  system  boundaries, 
external  interfaces  to  the  system,  and  describes  how  the  system  under  design  fits  into  the  larger 
architecture  and  organization  of  the  higher  entity  of  which  the  system  is  a  part.  It  does  not  address 
the  internal  structure  of  the  system,  but  serves  to  define  the  system’s  interfaces  and  relationships  to 
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other  systems  and  activities  which  are  considered  outside  the  scope  of  the  system  under  design.  It 
also  describes  and  defines  the  objects  external  to  the  system  and  the  behavior  of  those  objects  in 
relationship  to  the  system  under  design. 

Information  captured  concerning  those  systems  which  are  external  to  the  system  under  design  is 
driven  by  a  spectrum  of  activities  from  providing  a  means  for  simulating  external  system  behavior 
in  support  of  modeling  the  performance  of  candidate  system  designs,  through  simulating  and 
stimulating  external  interfaces  during  testing  of  the  actual  system  during  integration.  The 
complexity  of  an  external  model  developed  as  a  “simulation  harness”  is  driven  by  the  level  of 
fidelity  required  to  stimulate  the  internal  model  of  the  system  under  design.  The  external  model 
should  support  the  system  design  process  from  an  early  top  level  representation  to  a  very  detailed 
level  of  design.  The  capture  techniques  employed  to  represent  the  external  systems  must  therefore 
be  flexible  and  support  evolution  from  an  abstract  level  to  a  very  detailed  level  of  fidelity.  The 
capture  approach  should  also  be  compatible  with  the  approach  employed  to  capture  the  system 
under  design  to  ensure  that  internal  and  external  simulation  models  created  from  design  capture 
activities  are  easily  integrated. 

Based  upon  the  forgoing  considerations,  an  obvious  candidate  approach  for  capturing  external 
systems  is  to  employ  the  same  techniques  used  to  capture  the  system  under  design  summarized 
earlier  in  this  paper.  While  this  would  provide  the  extensibility  and  compatibility  features  needed, 
it  would  also  be  costly  to  create  the  five  capture  views  for  each  external  system.  This  reasoning 
has  led  to  tentative  selection  of  the  Implementation  Capture  View  as  the  basis  for  capture  of 
external  systems.  It  is  important  to  note  that  the  external  interfaces  of  the  system  under  design 
(which  are  considered  part  of  the  system  under  design)  are  intended  to  be  captured  using  the  full 
five  capture  views.  Further  investigation  is  required  to  determine  if  an  adequate  external  model 
representing  the  external  systems  can  be  constructed  from  the  information  contained  in  the 
Implementation  Capture  View. 

The  current  method  for  documenting  the  Implementation  Capture  View  is  described  by  N.D. 
Hoang  [7].  This  approach  for  capturing  external  systems  would  potentially  provide  the  benefits  of 
using  a  common  approach  for  design  capture  and  modeling  at  an  acceptable  cost.  Some  extensions 
and  modifications  to  the  Implementation  Capture  View  as  currently  envisioned  may  be  required  and 
will  be  identified  as  this  work  continues.  An  example  capture  of  external  systems  using  this 
technique  is  currently  being  developed  using  this  approach  for  the  sample  sonar  system  described 
by  Karangelen  and  Hoang  [6]. 


ENVIRONMENTAL  CONDITIONS,  SYSTEM  TEST  STRATEGY,  AND  SYSTEM 
MAINTENENCE  AND  LOGISTICS  CONSIDERATIONS 

These  three  elements  of  the  Environmental  Capture  View  include  additional  key  information  which 
can  have  considerable  impact  on  the  system  under  design.  The  environmental  conditions 
represented  by  prevailing  meterological,  electromagnetic,  and  acoustic  conditions  the  is  often  a  key 
factor  in  the  performance  of  sensors  such  as  radar  and  sonar,  and  in  the  performance  of 
communications  systems.  The  system  test  strategy  and  system  maintenence  and  logistics 
considerations  can  typically  impose  design  constraints  which  must  be  identified  early  in  the  design 
process  to  aviod  the  potentially  high  cost  of  back-fitting  test  and  maintenence  capabilities  in  the  late 
stages  of  system  integration  and  testing.  To  date  no  formal  method  for  the  capture  of  these 
elements  of  the  Environmental  Capture  View  have  been  established.  Exisiting  techniques  for 
development  and  capture  of  system  test  strategy  and  system  maintenence  and  logistics 
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considerations  are  currently  being  reviewed.  No  formal  capture  techniques  for  environmental 
conditions  in  general  have  been  identified  to  date  however,  considerable  data  is  available  form  a 
variety  of  sources  for  specific  classes  of  systems  such  as  active  and  passive  acoustic  underwater 
sensors,  as  well  as  land  and  air  based  radar.  One  potential  method  for  organizing  and  capturing  the 
environmental  data  is  based  on  categorizing  the  various  environmentally  dependent  technologies 
and  establishing  a  format  for  each  specific  technology  area  (e.g.  radar,  sonar,  etc.) 


FUTURE  WORK 

This  paper  describes  a  preliminary  description  of  an  Environmental  Capture  View  within  the 
context  of  a  multi-domain  design  capture  and  analysis  methodology.  Further  refinement  of  the 
techniques  for  characterization  and  selection  of  operational  scenarios,  description  of  system 
concept  of  operations,  and  capture  of  external  systems  and  interfaces  is  required  to:  (1)  address  the 
employment  of  automated  design  capture  and  simulation  methods,  (2)  enhance  compatibility  of  the 
Environmental  Capture  View  with  other  design  views,  and  (3)  to  provide  a  mechanism  for 
traceability  of  system  requirements  and  design  rational.  Capture  techniques  have  yet  to  be 
specified  for  environmental  conditions,  system  test  strategy,  and  system  maintenence  and  logistics 
elements  of  the  Environmental  Capture  View. 
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1  Introduction 

A  primary  concern  in  the  development  of  large-scale,  real-time,  complex.  conipiiter-intensi\«- 
systems  is  ensuring  that  the  performance  of  the  system  meets  the  sp  cified  recjuiiements. 
As  part  of  the  system  development  and  maintenance  process,  many  decisions  and  trade¬ 
offs  are  made  that  affect  a  variety  of  components  of  the  system.  Further,  the  recjuirements 
themselves  evolve  and  undergo  many  clianges  during  the  dcveloi)ment  process.  In  such  a 
conte.xt,  it  is  esseiit  ia!  to  maintain  traceability  of  requirements  to  various  outputs  or  art ifacts 
produced  during  th<>  systems  design  process,  to  en.sure  that  the  system  meets  ilu'  current 
set  of  requirements.  .Maintaining  consistency  between  the  requirements  and  tin-  design  is 
especially  critical  in  situations  where  an  organization  relies  upon  outside  contract ois  for 
developing  systems  Having  a  systematic  way  of  validating  that  every  re((uiren)enl  is  nuM  In- 
the  design  is  imi)ortant.  not  only  to  ensure  that  the  system  performs  ron-('clly.  but  also  to 
determine  whether  contractual  olrligations  liave  been  met. 

It  should  be  noted  that  throughout  tliis  jraper.  the  term  design  is  used  to  reler  to  any 
activity  that  leads  to  the  creation  of  artifacts.  Potts  and  13runs  [l]  note  that  even  the  early 
phases  of  the  .systems  dex  clopment  process  involve  the  creation  of  intermediate  art  ifacts  and 
the  term  design  could  be  used  to  denote  such  activities  as  well. 

.A  comprehensive  scheme  for  maintaining  traceability,  especially  for  complex,  real-time 
systems,  requires  that  all  system  components  (not  just  .software},  created  at  various  stag<’s  of 
the  development  process,  be  linked  to  t  he  requirements.  These  component  s  include  hardware, 
software,  humanwarr-.  manuals,  policies  and  procedures.  In  order  to  achieve  this  objective, 
it  is  essentia!  that  traceability  be  maintained  through  various  phases  of  the  systems  devel¬ 
opment  process,  from  the  requirements  as  staled  (or  contracted)  by  the  customer,  through 
analysis,  design,  implementation  and  testing  to  the  final  product. 

In  the  next  section,  we  discuss  past  research  and  current  tools  for  traceability.  .Next, 
we  discuss  our  approach  towards  developing  a  model  of  traceabilily.  VVe  describe  an  initial 
empirical  study  and  preliminary  results  that  are  being  used  in  the  design  of  future'  work. 
Finally,  we  present  an  example  of  a  complex  traceability  relationsIniJ  based  on  a  model 
(design)  rationale  in  requirements  engineering  and  address  some  of  the  issues  raised  in  our 
stud}’. 


249 


2  Background 

2.1  Definition  of  Traceability 

A  variety  of  definitions  of  traceability  have  been  proposed  in  the  literature  depending  on 
the  intended  use  of  traceability  information.  Greenspan  and  McGowan  [2]  provide  a  generif 
definition  of  traceability:  traceability  as  a  property  of  a  system  description  leclinirpie  tliai 
allows  changes  in  one  of  the  three  system  descriptions-  requirements.  d(‘sign  spe(  ifx  al  i<jns. 
implementation  -  to  be  traced  to  the  corresponding  portions  of  the  other  descri|)t  ions.  I'lie 
correspondence  should  be  maintained  thought  the  lifetime  of  the  system. 

Schneidewind  defines  traceability  as  the  ability  to  identify  the  technical  informat  ion  which 
l)ertains  to  a  software  error  which  has  been  detected  during  the  maintenance  jjliase  and 
thereby  trace  the  error  to  the  applicable  design  specifications  and  user  rcquiremt'iils  [.3],  [  ij. 

The  need  to  provide  traceability  is  recognized  in  most  critical  standards  gov  erning  1  lie  d('- 
velopment  of  software  for  the  U.S.  Government.  For  instance  the  Dol)-.STl)-21(i7.\  specilie.s 
that  ‘‘the  contractor  shall  document  the  traceability  of  the  requirements  allocated  from  t  he 
system  specification  to  each  Computer  Software  Configuration  Item  (CSCl).  its  Computer 
Software  Units  (CSUs).  and  from  the  CSl'  level  to  the  Software  Requirements  Specifications 
(SRSs)  and  Interface  Requirements  Si>ecifications  (5],(6)’b  An  elaboration  of  this  n'quire- 
ment  states  that  “the  Software  Design  Document  describes  the  allocation  of  rerpiiremeiits 
from  a  CSCI  to  its  CSCs  and  CSlis[6]. 

It  should  be  noted  that  even  this  elaboration  is  not  specific  about  the  nature  of  tlu' 
linkages  to  be  maintained  and  leaves  tie'  interpretation  of  the  meanings  of  such  linkages 
to  the  users.  Unless  the  semantics  of  the  linkages  are  clearly  specilic'd.  tin*  cwistiuice  of  a 
link  between  a  CSCI  to  its  CSCs  could  denote  one  of  several  possibilities  induding  :  the 
requirements  have  been  completely  allocated,  some  of  the  CS(^'s  satisfy  some  asjiects  of  the 
reriuirements,  it  is  possible  to  verify  that  the  requirements  have  been  completely  satisfied. 

Many  of  the  above  definitions  are  geared  toward  maintaining  traceability  in  software 
components.  Our  goal  is  to  develop  a  model  that  will  address  traceability  issues  at  the  h'vel 
of  systems  design,  relating  requirements  to  all  system  components.  I-'urther.  such  a  model 
should  not  only  discuss  the  kinds  of  linkages  or  relationships  that  should  be  maintaiiu'd,  l)u! 
also  the  reasoning  that  can  be  performe<l  with  such  traceability  information. 

2.2  Why  Traceability? 

.4s  discussed  earlier,  requiiements  traceability  is  imperative  to  ensure  tin'  closure  of  all 
systems  components  [7]. 

As  requirements  tend  to  evolve  over  the  long  life  of  large  scale  systems,  maintaining 
systems  to  meet  such  evolving  needs  is  critical.  As  large  scale  systems  are  comi)osed  of 
interdependent  components  and  as  the  design  process  “spreads  information",  <'ven  small 
changes  at  the  level  of  requirements  may  lead  to  major  changes  to  various  i>arts  of  t  he  syst  em. 
Requirements  traceability  will  go  a  long  way  in  alleviating  the  proldem  of  maintenance 
by  facilitating  the  identification  of  interdependencies  among  components  and  localizing  the 
effects  of  changes  made  at  various  levels  of  systems  design.  Further,  if  the  relat  ioj)ships 
between  design  and  the  requirements  can  be  maintained,  any  change  to  the  design  can  be 
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analyzed  to  determine  if  the  system  still  meets  every  requirement.  Since  every  rec|uiremenl 
that  is  affected  by  a  part  of  the  design  can  be  identified,  the  side  elfeets  of  chang<‘s  can  )><■ 
contained  and  avoided. 

Arguing  for  the  importance  of  traceability,  Choi  and  Scacchi  slat<'  that  the  (oriectness 
of  a  configured  software  description  (i.e.,  a  software  system’s  life  cycle  descriptions)  can  be 
formalized  in  terms  of  their  consistency,  completeness  and  traceability  [8]. 

Traceability  is  required  irrespective  of  the  software  design  methodo!og\-  or  t  lie  liardware 
software  architectures  used  in  a  project.  Some  methodologies  pro\-ide  tight  relationships 
between  components  produced  during  various  stages  of  the  design  process  and  hence  aulo- 
maticalh'  provide  traceability.  An  extreme  example  is  the  development  of  softwai'e  liased  on 
formal  specifications.  The  formal  requirements  are  transformed  into  executable  syst  ems.  Fhe 
transformation  history  provides  traceability  between  formal  requirements  and  the  executal^le 
system.  Automated  code  generation  using  a  fourth  generation  language  is  an  examph'  of  such 
‘automatic’  traceability  which  is  limited  in  scope. 

A  recent  workshop  on  reuse  in  practice  concluded  that  an  encironment  to  facilitate  reii.se 
must  support  automatic  traceability  of  a  component  through  the  re<iuirements  to  the  exe¬ 
cutable  components.  Traceability  is  imouitant  for  user  understanding  of  the  component's 
design  and  implementation,  since  It  Captures  the  context  and  the  constraints  ol  the  devel¬ 
opment  process  and  that  this  u:  t''  nanding  a.ssisls  the  u.ser  of  a  comi)oiK'nt  in  rcnising  ii  in 
anotlicr  situation  [9]. 

3  Traceability  Tools 

Initial  work  on  traceability  concentrated  on  providing  document  traceability.  Doctiment 
traceability  determines  the  existence  of  relalion.ships  between  two  docnmeni  components 
[101.111].' 

ARTS  [12]  is  among  the  earliest  systems  to  capture  and  use  ol  traceability  information. 
The  current  commercial  state-of-the-art  requirements  traceability  tools  (i.e..  employed  in 
tools  sudi  as  Teamwork/RQT.  R-trace.  RDD-100)  simply  link  re(|uirement.s  to  i)i<x'<'s  of  the 
design  and  imjdcmentation.  Current  tools,  similar  to  .ARTS,  tend  to  locus  on  tlie  database 
management  issues  related  to  maintaining  links  between  requirements  and  \arit)us  compo¬ 
nents  of  the  system.  .An  area  that  is  not  adequately  addre.s.scd  by  cnrrcnit  appioaches  is  the 
capture  and  use  of  the  semantics  of  the  relationships.  For  instance,  thc.se  lecllni(lU('^  do  not 
address  the  issue  of  representing  how  the  requirement  is  satisfied  b\-  tlie  design,  but  just 
facilitate  capturing  the  fact  that  some  relationship  exists. 

Another  shortfall  with  today’s  traceability  tools  is  that,  they  lack  the  ability  to  trace 
back  from  the  actual  pieces  of  design  and  implementation  to  the  requirements.  .Although 
some  tools  such  as  Teamwork/R.QT  and  R-trace  allow  the  u.ser  to  trace  from  r(’(|uireni<'nts 
analysis  tools  such  as  Teamwork  and  Software  Through  Pictures,  t  he\-  <lo  not  ha\  e  a  metliod 
for  tracing  from  a  particular  piece  of  hardware  or  humanware  back  to  the  requirement  s.  Tins 
capabilit}'  would  be  extremely  useful  in  performing  systems  maintenance. 
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4  Towards  a  Model  of  Traceability 

The  need  for  better  understanding  of  traceability  is  widely  recognized.  As  discussed  in 
the  previous  section,  maintenance  of  traceability  is  also  mandated  by  several  standards. 
However,  the  precise  definition  of  the  kinds  of  traceability  linkages  or  relationships  t  hat  must 
be  maintained  is  currently  lacking.  A  major  challenge  in  this  research  is  the  development 
of  a  model  that  represents  and  provides  the  semantics  of  various  traceability  linkages  or 
relationships  between  requirements  and  S3^stem  components. 

There  are  a  variety  of  stakeholders  involved  in  the  systems  development  process,  includ¬ 
ing  project  sponsors,  project  managers,  analysts,  designers,  maintenance  personnel,  testing 
personnel,  and  end  users.  A  basic  premise  in  our  research  is  that  the  develoi)ment  of  a 
model  of  traceability  could  be  geared  towards  the  needs  of  these  various  stakeholders  at 
various  stages  in  the  systems  development  process.  The  first  phase  of  our  approach  to  this 
luoblem  has  been  an  empirical  one.  VVe  have  conducted  an  initial  empirical  stud>-  to  e.xplore 
the  traceability  needs  of  various  stakeholders.  The  results  of  this  study  are  being  use<l  in 
designing  a  comprehensive  stud}’  involving  stakeholders  in  large  scale,  complex.  K'al-time 
systems  development  efforts.  In  this  paper,  we  present  the  details  of  the  initial  study  and 
some  preliminaiy  findings  which  will  be  explored  in  the  follow-ujr  studies. 

Study  Design  Our  data  collection  strategy  in  the  initial  study  involved  a  two- pronged 
approach:  focus  groups  interviews  for  idea  generation  fc  evaluation  and  protocol  analysis  of 
problem  solving  behavior. 

Subjects 

The  subjects  in  the  study  came  from  a  Masters  program  in  Information  'Feclmologv  at  the 
Naval  Postgradtiate  School.  The  study  wa.s  conducted  after  the  st  udents  had  comi)I('ted  the 
analy.sis,  design  and  implementation  of  an  information  sj'stem  ba.sed  on  a  case  study.  'FIk' 
case  study  was  developed  based  on  a  real-life  large  scale  project  and  had  been  successfully 
used  in  similar  studies  [13].  The  ca.se  analysis  involved  a  variety  of  data  gatlx'iing  nx'lh- 
ods  during  the  analysis  pha.se  including  informal  descriptions  of  user  needs,  simulated  clic'nt 
meetings,  and  actual  documents  from  real-life  situations.  The  major  outputs  developed  bv 
the  participants  included  requirements  statements,  data  flow  diagrams,  entity-relationship 
diagrams,  database  design  and  implementation.  These  activities  were  completed  during  a 
period  of  over  two  months  i)rior  to  the  subject’s  participation  in  the  focus  groups.  .Manv  sub¬ 
jects  had  extensive  experience  in  domains  other  than  computer  based  .systems  development 
such  as  ship  building  and  aviation  maintenance  where  concepts  of  traceability  ar<'  wi<lely 
used. 

Task 

The  case  study  was  in  the  domain  of  customer  order  proce.ssing  in  a  utility  company. 
The  problem  was  chosen  for  several  reasons; 

1.  The  case  study  has  been  develojred  based  on  data  from  an  extensive  domain  analysis. 
The  domain  analysis  was  based  on  a  real  life  system  developed  by  a  large'  inlormalion 
.systems  consulting  organization. 

2.  The  case  study  has  been  used  successfull}-’  in  several  settings  including  protocol  anaivsis 
of  group  problem  solving  behavior. 


252 


3.  The  problem  domain  is  familiar  to  the  subjects  as  they  have  liad  personal  experiences 
with  the  services  provided  by  the  system. 

4.  Real  life  data  could  be  easily  collected  from  a  utility  comi)aiiy  and  used  in  the  analysis 
and  design  of  the  system  when  necessary  (e.g.,  rate  schedules  were  collected  from  the 
local  utility  company  and  used  in  systems  design) 

5.  The  problem  is  sufficientlj'  complex  to  cover  all  the  basic  elements  of  systc-ms  df'sign 

6.  The  problem  could  be  partitioned  so  that  different  groups  of  stud('nts  could  be  assigned 
projects  that  could  be  completed  within  a  reasonable  time  franu'. 

Focus  Groups 

Focus  group  interviewing  is  among  the  most  frequently  used  form  of  qualitative  data 
collection  technique.  Focus  group  interviews  are  widely  used  in  several  domains  including 
market  research.  One  of  the  major  purj^oses  of  focus  groups  is  idea  Generation. 

Setting  Focus  groups  were  conducted  in  a  relativel}'  formal  setting  -  a  group  meeting 
room  equipped  with  facilities  for  audio/video  recording.  Each  focus  group  consist (sl  of  .a- 10 
panelists/students  and  the  following  steps  were  involved  in  each  session: 

•  A  short  warm-up  period  during  which  everyone,  including  the  moderator,  got  intro¬ 
duced  and  the  ground  rules  of  the  intei  view  stated. 

•  This  was  followed  by  a  predisposition  di.scu.ssion.  about  the  conte.xts  in  which  I  he 
traceability  issues  needed  to  be  exjrlored.  Tliis  inchulcrl  geneial  tliscussions  on  tlu' 
stakeholder  interests  in  traceability. 

•  Relevant  material  on  traceability  from  current  vendor  announcements  and  reseaich 
briefs  were  presented.  These  provided  the  basis  for  further  discussions  on  I  heir  st  r('ngl  hs 
and  weaknesses  as  well  as  modifications/extensions  needed  on  current  a))proach(‘s. 

•  .After  all  material  had  been  discussed,  a  collective  and  comparative  discussion  of  all 
topics  was  conducted.  This  was  followed  by  a  wrap-up  of  the  discussion.  During 
the  wrap-up  session,  t  he  participants  were  prodded  for  their  summaries  of  what  was 
discussed  in  the  group  meeting. 

Protocol  Analysis  A  stud}'  of  the  problem  solving  behavior  of  subjects  engagc'd  in  a 
traceability  exercise  was  conducted.  The  ])rimary  source  of  data  was  tlu'  verbal  protocols 
of  subjects.  The  verbal  protocols  provide  a  trace  of  the  thinking  process  in  aniving  at  a 
solution.  The  subjects  in  this  exercise  were  required  to  identify  traceability  information  that 
could  be  incorporated  in  their  projects  to  satisfy  various  stakeholders. 

E\iture  Research 

The  major  purpose  of  the  above  mentioned  studies  is  to  provide  the  basis  for  conducting 
an  empirical  study  in  real-life  systems  development  environments.  The  outcomes  of  the  c  ur¬ 
rent  studies  will  help  in  the  design  of  que.stionnaires  as  well  as  the  design  of  focus  groups  and 
structured  interviews  with  ’real”  stakeholders  in  systems  development,  (designers,  analvst. 
end  users  etc.) 
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Comparison  of  the  two  priniary  data  collection  strategies  iMox  itlcs  some  miercsiing  in 
sights  into  the  appropriate  research  methodolog\-  fc  •  future  work,  i'ocus  gituips  piux  ided 
surprisingly  interesting  results.  In  an  exploratory  data  collection  itieiliod.  the  ri’-caK  In'i 
biases  do  not  constrain  the  participants.  In  our  study,  for  instance,  many  paitn  ipaiiis  i<- 
lated  concepts  of  requirements  traceability  to  their  experiences  in  shi)>-bnilding  an<l  ain  rail 
maintenance  which  employ  similar  concepts.  Focus  groups  conducted  with  |)arti<  ipanis  \vh<j 
have  real-life  systems  development  experience  is  likely  to  provide  \ciy  vahiabh*  soukxn  of 
information.  As  the  participants  are  not  restricted  by  the  researcher  s  ideas  and  ju<'(ii^p<j- 
sition,  this  metliodology  will  often  provide  new  perspectives  and  approaches  to  the  probiem 
being  explored.  Protocol  analysis,  on  the  other  hand,  is  likely  to  pro\  ide  detaihsl  iiifuiina 
tion  on  a  problem  solving  task  in  which  the  participant  has  suiheieni  knowh'dge.  llov.auer, 
it  is  extremely  expensive  in  terms  of  demands  on  the  subjects  and  the  leseajchei  and  nia\ 
involve  extensive  work  in  study  design.  Therefore,  the  use  of  this  meihodohjg\  should  be 
restricted  to  a  very  small  number  of  subjects. 


4.1  Issues  in  the  development  of  a  model  of  Traceability 

In  the  following  section,  we  discuss  some  preliminary  findings  that  would  help  dexcltjp  a 
model  of  traceability  and  mechanisms  to  support  capturing  and  leasoning  with  this  inh/r 
mation.  These  findings  suggest  that  several  areas  need  to  be  addressed  b\-  futun'  i<  sratch. 
Several  examples  drawn  from  prior  research  have  been  included  to  x'laborate  th<'  nnijiti  isstn's. 

•  Different  stakeholdc'rs  are  (iiady  to  have  different  uses  for  a  gi^■|'tl  ( racrxdiilit  v  linkage 
or  relationship  Ijetween  system  comjronents.  Furtlu'r  tlu’.v  ma\'  also  n(<  (l  dilieient 
type  of  traceability  linkages  between  the  same  systems  components.  Fot  iiistance.  tlie 
traceability  linkages  between  a  requirement  and  an  imi)lement  at  ion  may  dcuioK'  i  hat  t  he 
implementation  .se//.s//( s  t  he  reriuirement.  The  end  user  may  Ire  primarily  inieicsied  in 
using  this  information  in  ascertaining  that  the  .system  iiu'els  hi.s  or  lu'i  |■e(juil emenl 
whereas,  the  testing  irersoniud  may  he  interested  in  d('ri\ii)g  this  linkagi'  from  the 
linkages  betwr'en  reciuirenients  and  tests  procedures  (e.g..  I(sl)  and  llu'  i elai ions!ii|)s 
between  impiemenlat ion  and  test  procedures  that  'wilidatc''  the  Imph'iiK-uia!  ion.  If 
the  traceability  information  can  be  used  to  verify  whether  the  Ic'sls  were  wdid  and 
comprehensive  and  that  the  tests  fully  validate  that  the  imphuiK'iitat iou  metUs  all  tin- 
test  criteria,  then  the  implementatirm  can  be  thougiit  to  'satisfy'  reciuiremeiits. 

The  above  persirective  elaborates  on  traceability  as  a  measure  of  cpialilx  of  t  lu'  syst<un. 
Quality,  as  viewed  by  the  customer,  is  the  degree  to  which  the  piarducts  complies  witli 
their  needs  [14j.  From  a  project  manager’s  perspectix'e.  a  major  j)urpos(’  of  quality 
assurance  is  to  ensure  that  “projects  are  proceeding  on  scheduh'.  within  Ijndgct  and  in 
a  traceable  manner,  and  in  accordance  with  customer  requiix'mciits  and  performame 
criteria  [15]”.  The  linkage  ’satisfies'  as  defined  above  to  satisfy  an  end-user  is  uiilikclv 
to  be  of  much  interest  to  the  program  manager  who  can  not  wait  until  the  testing 
phase,  but  should  ascertain  whether  intermediate  contponent.s  such  as  designs  nuxu 
the  requirements.  It  is  obvious  that  the  concept  of  quality  of  a  system  will  Ix'  differcut 
from  the  perspectives  of  different  stakeholders.  Various  ilit  ies  d(>(ine  (piality  from  t  lu' 
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perspectives  of  dilferent  stakeholders;  hence  the  need  for  different  types  of  (.l  aceability 
information. 

•  The  design  process  spreads  information;  i.e,,  several  comi)onen(s  may  l)e  nc'cessary  to 
satisfy  a  requirement.  As  the  system  evolves  over  its  development  cycle,  it  is  desirable 
to  identify  design  or  implementation  elements  that  ‘partially  satisfy*  a  given  require¬ 
ment.  For  instance,  a  hardware-software  combination  is  often  necessary  to  satisfy 
a  given  requirement.  When  either  the  hardware  or  software  component  is  develoi)ed, 
traceability  information  should  reflect  the  fact  that  it  partially  satisfies  the  rec] ui remen t. 
Such  information  can  be  used  in  ensuring  that  the  partially  satisfied  requiiemenls  are 
fully  satisfied  by  performing  necessary  actions. 

•  A  corrolary  to  the  above  is  that  it  should  be  possible  to  identify  a  combination  of 
design  elements  that  ‘satisfy’  a  requirement  or  are  ‘generated  by’  a  requirement. 

.An  example  of  such  a  traceability  scheme  is  the  use  of  AND-OR  graphs  to  represent 
traceability  linkages.  .AND-OR  graphs  can  be  used  to  model  a  task  in  terms  of  a  series 
of  goals  and  subgoals.  Figure  1  illustrates  such  a  complex  linkage. 


Figure  1:  .An  example  of  Complex  Linking 

•  A  traceability  scheme  should  recognize  that  all  requirements  are  not  equal  in  terms 
of  levels  of  importance  or  significance  or  criticality.  It  may  be  unnecessaiy  or  e\'en 
undesirable  (considering  the  overhead  involved  in  maintaining  traceability)  to  maintain 
linkages  between  every  requirement  with  every  artifact  created  during  systems  design 
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process  that  is  related  to  requirement.  It  is  essential  to  identiiV  criliral  letpiiHMiK'iits 
and  maintain  traceability  linkages  from  those  requirements  to  sssiem  (ompuiM'itis. 

•  .4  useful  way  of  identifying  the  critical  retjuirements  is  to  it  laic  tln-m  t<j  tlic  ceiitial 
’mission’  of  the  system.  The  business  processes  that  gtucrah  n-quiremeiil s  slxmld  la- 
identified  and  requirements  evaluated  with  respect  to  them  to  arrix  eat  a  classiiiration. 
l.e.,  Traceal>ility  should  addre.ss  the  issue  of  how  the  requirements  are  an  i\<'d  at .  d  hi'' 
necessitates  a  mechanism  to  represent  the  elaboration  and  n  jintiiuiil  of  re<|ui!eiin*iits 
from  the  central  mission  or  busine.ss  processes  that  (jineralt  reqnireinenls. 

•  A  tracealrility  scheme  should  allow  the  linkages  to  l>e  (pialificd  to  denote  wlu-th<  r  tlie 
link  can  be  verified  formally  or  informally.  Consider  the  link  from  a  rc'quiieiiHMif  to 
a  design  object  that  satisfies  the  requirement.  In  a  comple.x.  large,  real-time  systtun 
ascertaining  whether  some  requirements  have  been  satisfier!  can  Ix'  done  onl\  <|ualita- 
tively.  It  is  often  impractical  or  impossible  to  <ompri*hensi\ el\-  and  lormally  test  surli 
requirements.  It  is  especially  true  of  “generic”  recpi i remen ts  an  «-,\amp!e  of  which  is 
“the  system  shall  provide  a  user  friendly”  interface.  A  link  from  such  a  re(|uii<‘men;  to 
a  design  or  implementation  com]jonent  should  identify  not  onl\  whether  the  re<|u  re- 
merit  is  satisfied,  but  also  "how”.  I'his  information  can  be  in  the  form  of  a  link  to  a  (>st 
procedure  or  s|>ecification  or  a  qualitative  evaluation  by  tin’  usei .  i  rom  a  mainlc'iiance 
standpoint,  sucli  (jualitatix^ly  liatisfied  requirements  may  ix'ed  perio<li(  eNamination  to 
ensure  that  they  are  still  valid  with  changing  requirements  and  <•]  <  ommiinii  i<'s, 

•  Though  one  of  the  most  critical  uses  of  traceability  is  ensuring  ilial  a  design  eh'ineni 
satisfies  a  rerpiirement ,  the  e.xistence  of  such  a  link  may  not  answer  the  <|uesiion:  are 
the  functionalities  of  the  design  element  required  by  re(|  iiremenis?  .As  a  pait  of  th<' 
validation  and  \erif'  ation  process.  su':h  a  question  should  Ix'  answered  to  <-nsure  that 
there  are  no  unnecessary  functionalities  in  a  system  component  that  are  not  diiven  !)>■ 
user  needs:  i  e..  the  links  should  be  bidirectional  to  allow  reiiuiiements  t  rar  ing  forward, 
from  recpiirements  to  system  conqronents.  and  backward,  fiom  systrun  (omponenis  to 
requirements. 

•  As  systems  requirements  evohe  over  the  lifecycle,  it  will  be  beneficial  to  assess  tlu' 
impact  of  changing  recpiirements.  If  the  design  segm<-nls  tlfiiml  Jroin  re<iuir<‘ments 
and  the  implementation  that  ytnernled  by  design  .segments  are  readily  ichnitified.  the 
project  manager  will  be  able  to  make  an  informed  decision  about  t  ix'  effort  uc'cded  to 
implement  the  required  changes.  A  traceability  scheme  max  juovide  both  cpialitative 
(e.g.,  the  criticality  of  module.s  affc*cted)  as  well  as  (piantitat ix'e  (e.g..  the  number  of 
system  components,  code  .segments  affected)  io  aid  decision  making. 

•  An  important  component  of  traceability  information  is  de.sign  rationale.  Hationale 
specify  the  tvhy  oi  decisions  and  trade-offs  made  throughout  the  systems  dexelojunenl 
process.  This  information  will  be  of  interest  to  a  variety  of  stakeholders  who  are  in 
terested  in  understanding,  modifying  and  communicating  decisions  mach'  throughout 
the  system  dex'elopment  process.  Further,  this  information  will  be  extremely  useful  in 
change  management  in  the  contexts  of  ex'olving  requirements  and  assumptions,  fhe 
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rationale  for  systems  components  could  be  explicitly  or  impliciilij  spccifu.d  ij\  i(‘(|uir<’- 
ments  or  could  result  from  design  decisions. 

Hamilton  and  Beeby  [16]  define  traceabilitj'  as  the  ability  to  discover  tlic  hisUux  of 
every  feature  of  a  system.  Design  rationale  is  an  important  component  of  such  a 
history.  Brown  [17]  states  that  it  should  be  possible  to  identify  the  re(|uiiemen;  or 
design  decision  from  which  a  product  was  derived.  Design  rationale  identifies  not  only 
the  decisions,  but  also  the  reasons  behind  them. 

•  Traceability  information  ca.n  be  used  in  project  tracking  and  niauagement .  Traceabil¬ 
ity  links  between  various  components  of  the  system  may  include  information  usihI  in 
project  management  (such  as  completion  date,  status,  personnel  assignment ).  Sucii  an 
integration  of  project  management  as  a  part  of  the  systems  de\('lo|)ment  procc'ss  will 
ensure  that  timely  and  accurate  information  is  available  for  critical  project  manage¬ 
ment  tasks. 

•  .As  complex  .systems  are  composed  of  interdependent  components,  such  rl<  jx  nth  n(  i<  s 
should  be  represented  and  maintained.  Often  the  inter-comjwiienl  dej>eiKlencies  ari* 
not  well  understood  and  documented. 

Systems  design  is  a  complex  activity  involving  interdependent  decisions.  In  the  ahsemee 
of  mechanisms  to  record  such  dependencies,  over  time  and  with  changing  devetopment 
teams,  this  information  will  be  lost.  Such  dependencies  may  span  across  (lifferent 
S3'stem  components.  .A  decision  about  software  ma_v  be  clepeiiclent  on  an  oarli<’i'  <le<  ision 
about,  hardware.  For  instance,  a  iiardware  decision  to  use  Sl’.N  .Sparest at  ions  as  tJu' 
hardware  platform  may  lead  to  a  .software  deci.sion  that  uses  Sl’.NOS  as  the  opcualing 
system.  .As  the  .system  evolves  over  its  life  cycle,  the  hardware  dt'cision  may  g('t  (  hanged 
leading  to  inconsistency  with  the  software  that  was  ba.sed  on  the  <'arli('r  hardware 
decision.  Unless  tlie  dependencies  are  captured  and  maintained,  such  i.ssiH’s  ma\  go 
undetected  leading  to  se\ere  system  integration  problems.  Our  model  will  pro\  ide 
mechanisms  to  re])resent  and  reason  with  the  dependencie.s  among  design  (h'cisions. 

•  A  major  use  of  traceability  is  the  identification  and  assignment  oi  account abilit\-.  F,.\- 
amples  of  such  linkages  include  system  components  desiyned  by.  system  components 
tested  by  or  system  comimnents  validaf ed/verified  by  or  modified  by  d<'\elopment  jier- 
.sonnel.  Maintenance  of  accountability  information  will  facilitate  communication,  co¬ 
ordination  and  maintenance  of  a  .system.  It  is  especially  imirortant  to  maintain  this 
information  in  mission  critical  areas  of  the  .system.  .An  analogy  will  be  similar  th<’ 
information  maintained  in  aircraft  construction  and  maintenance'. 

•  A  comprehensive  mechanism  for  traceability  should  link  the  '‘humanware”  compoiH'ut 
of  a  system  to  the  other  components  of  the  system.  Examples  of  such  linkage’s  include 
system  functionalities  performed  by  humans.  This  information  is  lu’cessary  to  ensure 
that  the  allocation  decisions  are  complete  and  correct. 

•  Traceability  linkages  (e.g.,  explained  by)  to  systems  components  such  as  manuals,  poli¬ 
cies  and  procedures  that  specify  how  to  obtain  a  required  i)erformance  from  a  sN  stem 
component  are  as  important  as  the  information  about  the  ’why’  of  the  design  |)rocess. 
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•  Automated  support  for  traceability  is  extremely  important  given  the  volume  and  the 
complexity  of  the  task.  Traceability  information  should  be  captured  as  a  pari  of  tlie 
systems  development  process  automatically  when  possible.  It  is  esjx'cially  desirable  and 
convenient  to  do  so  in  situations  where  components  are  derived  from  or  tqnirtiU  nt  of 
(such  as  a  decomposition)  relationships  (e.g.,  data  flow  diagrams  and  structure  charts}. 


5  Design  Rationale  as  an  Example  of  Traceability 

A  conceptual  model  and  mechanisms  for  the  representation  of  and  reasoning  with  process 
knowledge  (i.e.,  design  rationale)  have  been  developed  in  earlier  research  as  a  ]>ait  of  the 
REMAP  (Representation  and  MAintenance  of  Process  Knowledge)  inoject.  Th(>  mode! 
and  the  mechanisms  provided  by  REMAP  for  representing  and  reasoning  with  traceabil’ty 
information  to  support  various  stakeholders  is  discussed  in  detail  elsewhere  [IS].  'I'liis  design 
rationale  model  can  be  viewed  as  an  instance  of  a  traceability  link  between  a  recpiirement 
and  a  design  element.  The  term  ’’design  element”  denotes  any  part  of  the  system  de.sign 
or  implementation  (i.e..  data  flow  diagrams,  specifications,  pieces  of  hardware,  humanware 
etc.).  In  this  section,  we  discuss  how  such  a  model  and  reasoning  mc'chauisms  can  l)e  used 
in  the  context  of  the  issues  discussed  in  the  previous  section. 

•  support  for  various  stakeholders:  There  are  a  variety  of  stakeholders  involved  in  laig<' 
software  projects,  each  having  a  different  set  of  goals  and  priorities.  For  t'ach  of  the 
stakeholders,  some  useful  support  can  be  provided  by  recording  in  some  slrnciun'd 
manner,  the  history  of  a  design  in  the  form  of  (design)  rationale. 

•  partially  satisfied  rociuirements:  The  proce.ss  of  satisfying  retjuiremenl s  ma\’  generate 
several  issues  that  need  to  be  re.solved.  Resolution  of  i.ssues  lead  to  one  or  moK'  design 
components.  Partially  satisfied  retiuirements  may  be  identified  with  unresol v('d  issues 
that  relate  to  that  requirement  using  structures  like  the  AND-OR  graphs  in  REM.XP. 
A  similar  structure  can  be  used  in  linking  design  artifacts  to  retpiiremeiils  through 
design  decisions. 

•  criticality  of  requirements;  Our  model  captures  the  elaboration  and  refinenn'iit  of  re¬ 
quirements.  Critical  ’mission  statements'  or  core  ’business  process'  object  i\'es  can  be 
the  origin  of  such  an  elaboration  and  refinement.  During  this  jnocess.  tlu'  critical¬ 
ity  or  importance  of  requirements  can  be  a.scertained  and  monitored,  d'he  RhiM.AP 
model  can  represent  this  information  as  an  attribute  of  the  links  betweer  '^li.ssioii  state¬ 
ments/business  processes  and  requirements  or  as  attributes  of  requirements  t  hemselves. 
Then,  the  critical  requirements  can  be  monitored  to  ascertain  whether  all  the  issues 
related  to  them  are  resolved  in  a  timely  manner. 

•  qualitative  and  quantitative  reasoning:  The  strength  or  other  characteristics  of  rela¬ 
tionships  can  be  either  qualitative  or  quantitative.  In  REM.AP,  the  contents  of  the 
primitives  can  be  informal  information  (such  as  text).  But  the  model  has  well  defined 
semantics  of  relationships  among  its  primitives,  facilitating  reasoning  with  this  struc¬ 
ture.  For  instance,  the  assumptions  in  a  design  situation  can  be  given  differeui  degree's 


258 


of  belief  (or  validity),  and  these  beliefs  can  be  automatically  i)roi)agated  to  beliefs  in 
arguments,  positions  and  so  on.  Further,  the  strengths  can  be  either  (|ua!itative  or 
quantitative. 

•  change  management:  In  REMAP,  changes  to  design  rationale  will  automat  i(  ai!\  trigger 
changes  in  the  belief  status  (or  validity)  of  design  solutions  thereby  suggesting  redesign 
[18].  Since  various  components  of  the  process  knowledge  that  lead  to  the  design  solut  ion 
are  tightly  related,  changes  to  the  constraint  set  resulting  out  of  changed  assumpt  ions, 
decisions  or  requirements  will  initiate  the  synthesis  of  a  new  design  solution  and  j)io\  ide 
rich  information  to  estimate  the  effort  involved  in  redesign. 

•  project  management:  REMAP  provides  facilities  for  representing  and  reasoning  with 
temporal  information  which  can  be  u.seful  for  project  management.  For  instance,  a 
validity  time  can  be  assigned  to  issues  which  could  be  interpreted  as  the  time  frame 
during  which  that  issue  must  be  resolved.  Then,  this  information  can  !><•  uscfl  for 
generating  reminders  to  the  designers  or  managers  to  focus  their  attention  on  issues 
that  may  have  to  resolved  within  a  time  frame  or  used  in  rank  ordering  issiu's.  Project 
planning  and  control  can  bo  facilitated  by  integrity  constraints  on  its  primitives.  .An 
example  of  such  a  constraint  could  state  that  no  requirement  can  be  ela!)oraled  or 
refined  until  all  requirements  with  higher  priority  or  earlier  validit  \- 1  inu'  are  consideied. 

•  accountability:  The  UEM.AP  environment  facilitates  the  automatic  capture  oi-  the' 
representation  of  accountability  information  a-s.sociated  wit))  df'sign  iviiiomde. 

•  Links  to  all  system  coinj)onents:  The  REM.AP  model  can  be  usc>d  to  capi uk'  i<diit ion- 
ships  between  requirements  and  all  system  components,  including  liumanwarc.  hiird- 
ware,  software  etc. 

•  automated  support:  REM.AP  provides  automated  support  lor  diffeie'iit  stala-liolders 
including  interactive  tiuerying  and  updating  of  the  design  rat  ionale  knowledge  base, 
a  client-server  architecture  for  multi-user  support,  a  textual  as  well  as  hypertext- 1  ike 
user  interface  to  the  knowledge  base  and  a  reason  maintenance  s\stem  for  maintaining 
and  reasoning  with  design  rationale. 

•  derived  links:  REM.AP  ])rovides  facilities  for  inferring  knov  ledge  based  on  deductive' 
rules  and  facilitates  the  derivatioi:  of  implicit  links  between  recji.’rements  and  design 
artifacts.  For  instance,  a  rule  could  state  that  if  a  design  element  is  created  by  a  d('ci- 
sion,  and  the  decision  resolves  an  issue  and  the  issue  was  generated  by  a  re(|uirem(’nt . 
then  the  design  element  traces  to  the  requirement. 

6  Conclusions 

The  preliminary  analysis  of  the  results  of  our  initial  study  suggest  that  comprehensive  models 
of  traceability  need  to  be  developed.  An  approach  to  developing  such  models  is  to  undei- 
stand  the  traceability  needs  of  various  stakeholders  in  the  systems  de\’eloi)ment  jnocess. 
Further,  a  model  of  traceability  should  represent  and  reason  with  the  .semantics  of  varioTis 
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traceability  relationships  in  supporting  system  development  and  maintenance  arti\  it  i<'s.  Our 
current  work  has  investigated  the  use  of  REMAP  design  rationale  model  and  the  reasoning 
mechanisms  supported  by  it  as  an  exam])le  of  such  an  approach.  Dexelopment  of  similar 
models  and  mechanisms  to  cover  other  important  aspects  of  traceabilil  v  are  being  addK'ssc'd 
in  ongoing  research. 
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ABSTRACT 

Specification  of  top  level  system  requirements  is  the  critical  first  step  in  the  development  of  large 
complex  computer-based  systems.  These  requirements  must  be  captured  in  a  clear  and 
unambiguous  form  to  avoid  unintended  inte^retations  by  design  personnel  and  to  permit 
expeditious  pursuit  of  system  development.  This  paper  addresses  the  opportunity  for  application 
of  embedded  expert  system  technology  in  the  develqjment  of  a  natural  language  interface  which  is 
tailored  to  sup^rt  implementation  of  semi-automated  environment  for  system  requirements 
generation  and  capture. 


INTRODUCTION 

Specification  of  top  level  system  requirements  is  the  critical  first  step  in  the  development  of  large 
complex  computer-based  systems.  These  requirements  must  be  captured  in  as  clear  and 
unambiguous  form  as  possible  to  avoid  unintended  interpretations  by  design  personnel  and  to 
permit  expeditious  and  direct  pursuit  of  system  development.  Top  level  requirements  for  large 
complex  systems  must  also  be  internally  consistent  and  complete  to  support  efficient  development 
of  system  design  and  to  avoid  false  starts  due  to  incoirect  or  vague  requirements.  Top  level  system 
requirements  are  often  subject  to  broad  interpretation  particularly  if  they  do  not  include  quantitative 
measures.  English  language  statements  of  required  functionality  can  inherently  ambiguous  and 
may  contain  multiple  meanings  which  are  context  sensitive. 

The  development  of  lop  level  system  requirements  typically  includes  a  broad  spectrum  of 
performance,  functional,  operational  and  other  requirements  established  by  domain  expens  who 
understand  the  customers  needs  and  constraints.  These  domain  expens  often  have  considerable 
experience  in  the  development  and/or  operation  of  similar  systems  but  may  not  have  formal  training 
or  experience  in  the  use  of  formal  or  semi-formal  specification  techniques  (i.e.  essentially  all  top 
level  specifications  are  written  in  english). 

Natural  language  interface  techniques  potentially  provide  a  mechanism  for  these  domain  expens  to 
interactively  employ  a  semi-automated  tool  and  a  structured  method  which  supports  system 
requirements  generation  and  capture  using  common  english  which  would  require  little  or  no 
training.  Expert  system  methods  represent  a  powerful  approach  in  the  realization  of  natural 
interface  for  requirements  capture  and  analysis.  This  paper  addresses  the  opportunity  for 
application  of  embedded  expert  system  technology  in  the  development  of  a  natural  language 
interface  which  is  tailored  to  support  implementation  of  semi-automated  environment  for  system 
requirements  generation  and  capture.  The  interactive  natural  language  interface  is  intended  to  act  as 
a  guide  to  domain  expert  users  in  development  of  formal,  consistent,  unambiguous  requirements 
for  large  complex  systems. 
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NATURAL  INTERFACE  OBJECTIVES 


One  of  the  principal  objectives  for  any  natural  interface  is  to  reduce  the  effort  expended  by  an 
operator  in  learning  and  using  a  given  operator-machine  interface.  In  addition,  this  proposed 
natural  interface  for  top  level  requirements  capture  will  provide  a  mechanism  for  informal 
unstructured  english  text  to  be  converted  to  a  set  of  formal  structured  requirements  through  an 
interactive  session  with  an  expert  system  conducted  in  english.  The  formal  structured  requirements 
captured  using  this  approach  can  also  be  converted  to  an  equivalent  statement  of  requirements  in 
english. 

The  payoff  in  implementation  of  an  effective  natural  interface  is  greater  where  the  user's  time  is 
highly  valued  (e.g.,  senior  executives)  and  where  familiarity  with  machine  oriented  interfaces  is 
low  (e.g.,  many  domain  specific  experts).  Domain  experts  who  typically  create  these  top  level 
system  requirements  must  also  be  the  ones  who  refine  them,  or  the  requirements  might  be 
misinterpreted  in  the  refinement  process.  An  interactive  natural  interface  to  a  formal  structured 
requirements  capture  method  would  provide  both  ease  of  use  for  the  user  and  a  well  structured, 
clear  requirements  specification  product. 

The  natural  interface  concept  described  in  this  paper  supports  generation  and  capture  of  a  clear, 
unambiguous,  complete  statement  of  the  top  level  requirements  and  design  constraints  for  a 
complex  system.  The  inteiface  is  intended  to  provide  uninitiated  users  who  are  subject  matter 
experts  a  mechanism  for  developing  and  capturing  effective  complex  system  top  level 
requirements.  The  natural  interface  methodology  should  accommodate  different  perspectives  of  the 
system  or  the  system  requirements  which  may  be  held  by  various  domain  experts  in  specific  areas 
(i.e.,  the  various  engineering  specialties). 


NATURAL  INTERFACE  AND  EXPERT  SYSTEM  EMPLOYMENT  STRATEGY 

The  symbolic  processing  technique  represented  by  a  forward-chaining  inference  engine  is  a  natural 
match  to  implementation  of  a  natural  structured  interactive  english  interface  and  to  the  pursuit  of 
generating  and  refining  a  complete  unambiguous  set  of  top  level  system  requirements.  The 
essential  strategy  for  implementing  an  expert  system-based  natural  interface  for  requirements 
capture  is  to  combine  a  high  performance  graphic  interactive  operator  machine  interface  with  an 
innovative  expert  system  designed  to  conduct  an  interactive  dialogue  with  the  user  through  natural 
english  language  constructs. 

The  expert  system  creates  a  dialogue  with  the  user  implementing  an  initial  rule  set  based  on  the 
english  language,  the  formal  structure,  and  information  specific  to  the  project’s  domain. 
Clarification  of  meaning  are  resolved  by  the  user,  and  the  expert  system  “learns”  based  on  the  user 
responses  to  the  expert  system  questions.  For  example,  if  a  group  of  words  are  categorized  as 
domain-specific  “jargon”  in  the  first  paragraph  of  a  sentence,  that  jargon  is  recognized  throughout 
the  rest  of  the  paper.  The  ability  of  the  expert  system  to  “learn”  and  “remember”  suppons  the 
capability  of  the  interface  to  parse  through  a  document  without  using  complicated  language 
recognition  techniques. 

The  expert  system  provides  a  mechanism  for  natural  interactive  exchange  with  the  operator,  and 
guides  the  capture  of  requirements  to  ensure  consistency,  avoid  ambiguity,  and  to  help  achieve 
completeness  based  on  a  format  representation.  To  help  the  user  better  understand  the  formalized 
requirement,  a  structured  english  version  is  created.  This  Equivalent  Statement  of  Requirements 
(ESOR)  is  a  limited  english  explanation  of  the  formal  requirements.  The  ESOR  allows  domain 
experts  to  understand  and  examine  the  formal  requirements  and  ensure  that  the  converted 
requirements  are  consistent  with  their  intentions. 
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PROTOTYPE  SYSTEM  REQUIREMENTS  CAPTURE  TEST  BED 


A  rapid  prototype  test  bed  for  examining  the  employment  of  expert  system  technology  to  top  level 
system  requirements  capture  and  analysis  has  been  created  using  UNIX  and  the  X  Window 
System  on  a  Sun  SPARCstation  2,  and  is  written  in  the  C  programming  language.  The  expen 
system  employed  is  CLIPS  (NASA  sponsored  through  COSMIC)  which  is  embedded  in  the  tool 
and  fully  integrated  within  a  UNIX  process.  A  preliminaiy  set  of  rules  have  been  developed  and 
demonstrated  which  address  the  three  key  capabilities  of  the  prototype  system:  (1)  parse  english 
language  requirements  statements,  (2)  generate  a  formal  structured  statement  of  the  requirement  in 
the  form  of  an  Information  Model,  and  (3)  ask  the  domain  expert  user  questions  to  support  correct 
parsing  of  the  requirements  and  to  provide  additional  data  to  complete  the  Information  Model.  The 
user  interacts  with  the  expert  system  by  responding  to  questions  and  asserting  additional 
information.  The  expert  system  leads  the  user  through  a  process  designed  to  generate  consistent 
unambiguous  requirements  statements  through  natural  english  interaction  with  the  user. 

A  unique  implementation  of  CLIPS  has  been  developed  as  part  of  the  prototype  test  bed.  This 
implementation  introduces  the  concept  of  “metarules”  as  they  pertain  to  natural  english  language 
requirements  parsing  and  creating  a  structured  formal  representation  of  these  requirements.  This 
approach  is  discussed  below. 


ENGLISH  LANGUAGE  REQUIREMENTS  PARSING  AND  A  METARULE 
APPROACH 

Parsing  english  language  requirements  and  creating  a  structured  formal  representation  of  those 
requirements  (such  as  an  Information  Model)  through  an  interactive  session  between  operator  and 
machine  is  accomplished  using  a  forward  chaining  expert  system  approach.  The  objective ’s  not 
to  extract  the  semantic  meaning  of  an  english  sentence,  which  is  a  much  larger  and  potentially 
overwhelming  task,  but  to  guide  the  human  operator  through  a  process  of  requirements  refinement 
and  capture.  The  natural  interface  proposed  here  to  support  system  requirements  capture  focuses 
on  a  carefully  chosen  subset  of  english  grammar  and  vocabulary  which  is  common  to  the  majority 
of  top  level  requirements.  The  selected  subset  is  tailored  to  provide  a  robust  english  language 
communication  capability  and  to  limit  the  processing  required  to  parse  and  understand  the  user 
input. 

The  embedded  expert  system  is  constructed  in  a  way  to  maximize  the  flexibility  of  the  knowledge 
base  through  the  use  of  an  innovative  metarule  approach.  The  metarule  approach  employed  is 
described  in  the  following  paragraphs  which  begin  by  describing  the  typical  operation  of  a  forward 
chaining  inference  engine  and  then  contrast  that  operation  with  the  metarule  concept.  The 
advantages  and  disadvantages  of  the  metarule  approach  are  also  addressed. 

An  expert  system  is  typically  composed  of  an  inference  engine  and  a  knowledge  base.  Within  the 
knowledge  base  is  one  or  more  sets  of  rules  and  facts.  Typically  the  rule  base  is  fixed  for  a  given 
execution  of  the  inference  engine  (although  different  subsets  of  rules  may  be  employed  at  various 
times)  and  the  facts  change  as  a  function  of  external  inputs  and  the  operation  of  the  inference 
engine.  In  the  case  of  a  forward  chaining  inference  engine  the  rules  are  constructed  in  an  “if  a  then 
b”  format  where  a  is  the  antecedent  fact  and  b  represents  a  consequent  fact.  (Both  a  and  b  can 
represent  complex  boolean  expressions  of  antecedent  facts  or  consequent  facts  respectively.)  The 
inference  engine  looks  for  matches  in  the  current  facts  with  the  antecedents  of  the  rules.  When  a 
current  fact  matches  an  antecedent  of  a  particular  rule,  the  rule  is  said  to  fire  by  essentially  as.serting 
the  facts  in  the  consequent  side  of  the  rule.  The  new  facts  are  then  addressed  in  the  same  manner 
by  the  inference  engine  until  no  additional  facts  are  generated. 
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The  metarule  approach  to  knowledge  base  creation  and  maintenance  as  applied  in  this  research 
employs  rules  and  facts  in  a  somewhat  different  manner.  In  this  unique  implementation  of  CLIPS, 
domain-specific  rules  are  pre-processed  and  represented  in  CLIPS  as  facts.  The  CLIPS  rules  that 
exist  in  the  knowledge  base  (referred  to  as  the  metarules),  consist  of  four  format-dependent  rules 
that  operate  on  all  of  the  user-defined,  domain-specific  rules  (see  Figure  1).  Using  this  approach, 
the  expert  system  partition  of  this  system  is  designed  to  efficiently  process  those  aspects  of  a 
natural  interface  language  parsing  to  support  system  requirements  capture. 


Figure  1:  CLIPS  Metarule  Approach 


Metarules  are  designed  to  operate  with  depth-first  conflict  resolution  and  forward  inferencing.  The 
term  "metarule"  refers  to  any  CUPS  rule  vMch  is  used  to  process  another  rule.  Metarules  can 
only  operate  on  niles  that  exist  in  a  predefined  form.  The  metarules  “Detect  a  Valid  Antecedent 
Clause”,  “Conjunctive  Antecedent  Clause”,  “Execute  User  Rule”,  and  “Mutually  Exclusive 
Parameters”  are  supported  by: 

<valid  clause>  formats  used  by  the  metarules 

ManageRule  Function  W  communication  from  C++  to  CLIPS 

<result>  communication  from  CLIPS  to  C++ 

Information  is  represented  in  the  CLIPS-based  expert  system  through  facts,  objects,  and  global 
variables.  The  user  has  access  (modification,  authoring,  examination)  to  the  sets  of  “if-then”  rules 
and  their  associated  facts.  During  run  time,  facts/rules  and  metarules  are  maintained  in  the 
knowledge  base,  which  is  processed  by  the  inference  engine.  Although  the  knowledge  base  can  be 
accessed  in  a  number  of  ways,  it  is  processed  only  by  the  inference  engine.  The  result  of  this 
processing  is  modification  of  the  knowledge  base  and  creation  of  additional  facts. 
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To  emphasize,  all  user-defined  rules  are  internally  stored  as  proforma  "facts”  in  the  knowledge 
base.  The  metarules  are  able  to  identify  these  facts  and  to  process  them  as  logical  decisions  of  the 
sample  type: 


Whenever  <antecedent>  then  <consequent>. 


Where: 

•  <antecedent>  represents  one  or  more  preconditions 

•  <consequent>  represents  one  or  more  actions  which  result  if  all  antecedents  are 

satisfied.  Consequents  can  act  as  antecedents  in  other  rules  (i.e. 
rules  can  be  chained). 

The  advantages  of  this  metarule  approach  include  processing  speed,  and  simplicity  of  user 
implementation.  Bench  mark  tests  on  using  CLIPS  as  an  embedded  system  indicate  a  dramatic 
increase  in  processing  time  with  respect  to  the  number  of  rules  that  the  inference  engine  processes. 
Tests  run  on  number  of  rules,  versus  number  of  facts  showed  conclusively  that  a  small  number  of 
rules  operating  on  a  large  number  of  facts  is  the  optimal  utilization  of  CLIPS.  Simplicity  of  user 
implementation  is  another  key  advantage.  From  the  user’s  perspective,  the  rules  will  be  seen  as 
english-like  sentences  and  will  be  constructed  primarily  by  responding  to  prompts  from  the  user. 
The  technique  minimizes  natural  language  processing,  provides  the  flexibility  of  adapting  to  the 
user’s  lexicon  and,  most  importantly,  does  not  require  the  user  to  “program”  rules  in  any  particular 
order. 

The  potential  shortcoming  of  the  proposed  metarule  approach  is  the  possibility  that  processing 
speed  may  still  be  slow.  As  CLIPS  is  an  inte^retive  language  (as  opposed  to  compiled), 
execution  speed  is  affected,  even  with  the  time-saving  advantages  ^en  by  the  metarule  approach. 
However,  using  the  expert  system  partition  of  this  system  as  designed  (to  efficiently  process  those 
aspects  of  a  natural  interface  language  parsing  to  support  system  requirements  capture)  tend  to  be 
bound  more  by  the  graphic  interface  than  with  anything  else,  and  is  not  anticipated  to  present  an 
unsolvable  problem. 


PRELIMINARY  DEMONSTRATION  -  EXAMPLE  LHF  RADIO  REQUIREMENTS 
ANALYSIS 

The  following  is  a  short  example  requirement  which  was  processed  using  the  prototype  system. 
The  example  top  level  requirement  is  part  of  an  actual  unclassified  UHF  radio  system  Tentative 
Operational  Requirement  (TOR).  The  session  begins  by  loading  the  raw  (e.g.  not  analyzed) 
requirements  document  and  initiating  the  expert  system.  The  expert  system  then  leads  the  user 
through  an  analysis  of  the  requirements  in  a  step  wise  fashion  and  builds  a  fact  base  and  an 
Information  Model  as  a  result  of  the  interaction  with  the  user.  As  the  fact  base  grows  through 
interaction  with  the  user,  it  is  used  by  the  expert  system  to  maintain  an  internally  consistent  set  of 
requirements  and  terms  of  reference.  During  the  process  the  expert  system  constructs  a  formal 
structured  Information  Model  (entity-relationship  diagram)  which  captures  the  requirements  as 
refined  through  the  interaction  and  also  develops  an  equivalent  statement  of  requirements  which  is 
a  structured  english  language  representation  of  the  Information  model.  Figure  2  illustrates  the 
original  requirement  as  input  into  the  system  as  well  as  the  Information  Model  and  ESOR  generated 
based  upon  an  interactive  session  with  the  user. 
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Figure  2:  Prototype  System  Output 


FUTURE  RESEARCH 

Continuation  of  this  work  will  begin  with  identifying  one  or  more  formal  structured  requirements 
capture  techniques  which  can  serve  as  test  cases  for  implementation  of  a  full  scale  natural  language 
interface  capability.  Research  in  natural  language  requirements  parsing  and  development  of  rule 
based  strategies  for  interactive  creation  of  structured  top  level  requirements  will  be  influenced  in 
part  by  the  format  and  characteristics  of  the  selected  formal  requirements  structure.  Additional 
aspects  of  the  future  work  will  include  characterizing  the  english  grammar  and  usage  nominally 
associated  with  requirements  specification,  and  development  of  graphic  interactive  techniques  to 
enhance  user  interaction  with  both  the  expert  system  and  the  captured  requirements. 
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INTEGRATED  SYSTEM  DESIGNER  (ISD) 


Man-  Blanchard 

Science  and  Technology  AsMX  jaics,  hic 
Suite  7(X),  4001  Norm  I*airtax  Drive 
Arlington,  Virginia  2220^ 


What  information  is  necessary  to 
represent  complete  system  designs? 
How  should  this  information  be  repre¬ 
sented  to  support  evolving  designs? 

Can  this  representation  support  reus¬ 
ability?  What  format  would  enable  all 
the  eiigineers,  end-users,  and  customers 
to  view  the  system  through  the  “same 
eyes”?  How  can  the  engineers  ensure 
the  proposed  design  fulfills  the  defined 
requirements'? 

While  CAamining  each  of  these 
questions,  we  explored  the  currently 
available  CASE  and  CBSE  tools.  Each 
of  today’s  tools  provides  a  partial 
solution;  what  is  lacking  is  a  single 
representation  addressing  all  the  above 
questions.  Our  response  is  The  Inte¬ 
grated  System  Designer  (ISD)  —  a 
single  method  of  representing  systems 
which  supports  all  the  phases  of  prod¬ 
uct  life  cycle;  domain  analysis,  system 
design,  trade-off  analysis,  development, 
and  maintenance.  The  proposed  solu¬ 
tion  supports  an  iterative  approach  to 
the  previously  outlined  tasks.  Dia¬ 
grams  1-3  summarize  ISD  and  its  rela¬ 
tionship  to  existing  tools. 


Returnint!  to  fieurc  1.  vou  will 
note  many  sources  of  lx)ih  input  and 
output.  'Hie  .seven  system  default  li¬ 
braries  are  one  example,  liach  of  these 
libraries  wil!  lx*  predefined  and  include 
classes  encapsulated  with  Ixiih  at¬ 
tributes  and  sers’ices.  from  which  users 
can  create  their  own  s\  .stern  representa¬ 
tion.  IvSD  will  provide  support  for 
updating  and  expanding  these  libraries. 
'Flieir  existence  serves  two  purposes: 
one.  the  user  is  able  to  quickly  create 
designs,  and  two,  the  system  has  a 
systematic  method  for  acquiring  the 
derails  of  the  system  design.  Tliis  aids 
in  both  providing  additional  suppoiiing 
documentation  and  a  consistent  set  of 
input  data  to  the  analysis  and  simula¬ 
tion  tools. 

Actually  this  infonnation  is 
pas.sed  to  the  Automated  Model  Builder 
before  any  analysis  or  simulation.  The 
Automated  Model  Builder  is  respon¬ 
sible  for  synthesizing  all  the  infonna- 
lion  contained  within  the  system  design 
and  creating  an  equivalent  model  that 
can  be  understood  by  the  analysis  and 
simulation  tools. 
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Figure  3 


'Fhe  other  visible  sources  of 
input  and  output  are  the  domain  spe¬ 
cific  component  libraries.  Their  usage 
is  primarily  for  constructing  the  do¬ 
mains  specific  component  of  a  new 
system,  using  components  from  past 
systems  and  applications.  These  com¬ 
ponents  may  range  from  complete 
applications  to  individual  low-level 
classes.  It  is  up  to  the  user  to  select  the 
appropriate  level  of  complexity  for 
reuse. 

The  components  contained  in 


figure  3  represent  the  resulting  system 
design.  As  you  v’’ii  notice  the  design  is 
broken  into  five  parts  —  domain  spe¬ 
cific,  human  interface,  data  manage¬ 
ment,  hardware,  and  task.  How  a  sys¬ 
tem  is  represented  using  this  technique 
is  explained  in  the  next  set  of  diagrams 
4-8  and  in  the  subsequent  paragraphs. 

Contained  within  the  Domain 
Specific  Component  will  be  the  people, 
physical  entities,  and  abstract  entities 
that  exist  within  your  domain.  The 
Human  Interface  Component  will  rep¬ 
resent  those  entities  and  operations  that 
constitute  the  user  interface.  Tradition- 
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ally,  these  are  menus,  menu  items, 
graphic  icons,  windows,  and  panes. 
However,  your  system  may  use  a  ther¬ 
mostat  as  an  input  device  —  whatever 
entities  and  operations  your  human 
interface  consists  of,  represent  it  here. 

Your  data  layout,  which  includes 
tables,  files,  and  databases  are  repre¬ 
sented  in  the  Data  Management  Com¬ 
ponent.  Each  class  and  correspond¬ 
ingly,  each  object  will  have  an  associ¬ 
ated  set  of  attributes  defining  the  layout 
and  location  of  the  data.  The  exact 
format  will  vary  depending  upon  the 
type  of  data  management  scheme  se¬ 
lected  for  your  system  design.  Your 
design  may  also  involve  more  than  one 
data  management  scheme.  Regardless 
of  the  scheme  chosen,  system  default 
classes,  with  encapsulated  attributes 
and  services,  will  exist  for  you  to  use  in 
building  your  layout.  Associated  with 
each  data  management  scheme,  will  be 
an  “object-server”  class,  with  the  ser¬ 
vices  “store”  and  “retrieve”.  In  actual¬ 
ity,  these  two  services  will  contain  calls 
to  other  store  and  retrieve  services,  to 
account  for  each  of  the  possible  loca¬ 
tions  the  data  could  be  stored.  This 
enables  the  analyzer  and  simulator  to 
model  storage  locations  such  as  cache 
and  RAM,  besides  a  secondary  storage 
device. 

Store  all  hardware  devices,  in¬ 
cluding  CPUs,  sensors,  actuators,  oper¬ 
ating  systems,  networking  components. 


pumps,  secondary  storage  devices,  and 
any  other  physical  devices  within  the 
Hardware  Component.  System  default 
components  with  associated  attributes 
and  services  will  be  available  to  the 
user  to  quickly  assemble  the  design  of 
the  hardware  system. 

All  the  services  defined  within 
the  Domain  Specific  Component,  Hu¬ 
man  Interface  Component,  and  Data 
Management  Component  will  be 
mapped  to  both  hardware  devices  con¬ 
tained  within  the  Hardware  Compo¬ 
nent,  and  human  resources  contained 
within  the  Domain  Specific  Compo¬ 
nent,  in  the  Task  Component. 

The  justification  for  this  repre¬ 
sentation  is  based  upon  the  history  of 
the  system  design  process.  Historically, 
components  such  as  the  user  interface 
has  been  very  volatile,  while  compo¬ 
nents  related  to  the  specific  domain 
have  been  more  stable.  Using  this 
knowledge  we  have  split  the  design  into 
its  respective  volatile  and  stable  com¬ 
ponents.  Since  system  performance  is 
one  of  our  concerns,  it  is  critical  to 
isolate  the  hardware  component  from 
other  system  components.  This  enables 
the  user  to  more  easily  test  various 
hardware  configurations,  without  sig¬ 
nificantly  impacting  the  other  compo¬ 
nents  of  the  design. 
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Diagram  4,  illustrates  the  decom¬ 
position  of  a  system,  hierarchically 
from  a  top-down  approach.  ISD  also 
provides  support  for  both  bottom-up 
and  integrated  approaches. 

In  diagram  5,  the  decomposition 
of  the  Domain  Specific  Component,  is 
demonstrated.  All  the  diagrams  shown 
here  are  applicable  to  all  five  compo¬ 
nents  —  the  exception  being  the  “Ob¬ 
ject-behavioral  Diagram”.  This  dia¬ 
gram  is  not  applicable  to  the  Task  Com¬ 
ponent,  since  this  information  is  already 
contained  within  the  PDL  description 


of  a  task. 

Diagrams  6  and  7,  contain  infor¬ 
mation  typically  associated  with  a  class 
Note,  the  rectangle  containing  “analy¬ 
sis”  is  not  solid,  like  the  others.  The 
reasons  being  analysis  results  typically 
provide  information  corresponding  to 
the  entire  design,  not  an  individual 
class. 

ISD  supports  the  class  relation¬ 
ships  demonstrated  in  diagram  8. 
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As  will  become  apparent,  our 
notation  describes  classes  with  their 
encapsulated  attributes  and  services, 
associations,  and  hierarchical  and  ag¬ 
gregation  relationships.  Supporting 
templates  are  provided  for  each  class, 
attribute,  service  and  relationship.  ISD 
describes  objects  in  terms  of  the  at¬ 
tributes,  services,  and  relationships 
associated,  with  the  originating  class. 


Additionally,  objeci  -,  are  described  in 
terms  of  the  messages  they  send,  in 
response  to  events  and  changes  in  state. 
The  object-behavioral  diagram  provides 
the  additional  supporting  notation 
required  to  describe  an  object’s  reac¬ 
tions  to  events  and  state  changes.  The 
subsequent  diagrams  9-14,  provide 
descriptions  of  the  supporting  con- 
stmcts. 
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Figure  12 


In  order  to  support  top-down  begins  by  defining  the  appropriate 

composition,  it  is  necessary  to  support  classes  at  level  n  and  decomposing 

both  class  and  service  decomposition.  these  classes  into  classes  at  level  n  - 1. 

Class  decomposition  is  fairly  straight-  Beginning  at  level  n  -1  and  working  up 

forward  and  has  been  practiced  for  the  hierarchy  is  also  appropriate.  Next, 

several  years,  as  has  functional  decom-  the  user  maps  the  services  associated 
position.  What  is  different  is  the  re-  with  level  n,  to  an  appropriate  set  of 

quirement  for  the  system  design  pro-  services  at  level  n  - 1  (see  figure  15), 

cess  to  function  in  an  object-oriented 
mode  first,  and  then  in  a  functional 
mode.  This  requires  a  slightly  different 
process  and  representation  than  what 
has  been  traditionally  used.  As  you  will 
see  in  the  following  diagram,  the  user 


Each  service  is  described  using 
PDL.  Before  a  language  choice  is 
made,  language  independent  PDL  is 
used;  after  a  choice  is  made,  the  previ¬ 
ous  PDL  descriptions  are  mapped  to 
their  corresponding  language  dependent 
representations.  In  many  cases,  the 
user  will  be  asked  for  additional  infor¬ 
mation.  Typically,  this  additional  infor¬ 
mation  consists  of  attributes,  associated 


with  language-specific  components. 
Diagram  16  illustrates  sample  lan¬ 
guage  dependent  components,  language 
independent  components,  and  PDL 
constructs. 


1  System-defined  Language  Independent  Specific  Classes 
Mutual  exclusion 
Synchronization 


i  Language-Independent  PDL 
Loop...EndLoop 
If...Then 

If... Then...  Else  I 

Begin...End 

Case...EndCase 


I 


System-defined  language  specific  classes 
Package 
Rendezvous 
Semaphore 
Shared  memory... 

Language  Specific  PDL  (for  Ada) 

Loop 
For  Loop 
While  Loop 
If... Then 

Select  [conditional]... 


Figure  16 
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The  Automated  Model  Builders 
uses  the  resulting  system  design  in 
conjunction  with  the  Building  Blocks 
Table  to  produce  the  model  for  analysis 
and  simulation.  Understanding  the 
model’s  representation  is  necessary  to 
understand  how  the  model  is  created. 
Figures  17  - 19  contain  a  sample  of  a 
model  ready  for  analysis  and  simula¬ 
tion. 

It  appears  to  be  a  long  series  of 
if-statements  —  which  it  is.  However, 
associated  with  one  of  the  if-statements 
is  an  “event”.  This  “event”  translates  to 
a  transition  within  the  analysis  or  simu¬ 
lation  tool.  This  “event”  translates  to  a 
service  on  the  system  description  side 
of  the  tool.  Also,  on  this  diagram  you 
wiU  see  probabilities.  These  represent 
the  frequency  of  choosing  one  path 
over  another  in  the  system  design. 

Other  data  includes  the  amount  of  time 
required  to  execute  the  service  or  ser¬ 
vice  statement  corresponding  to  the 
“event”.  The  execution  time  is  as¬ 
sumed  to  be  for  a  non-preemptive 
scenario.  Preemption  is  modeled 
through  additional  constructs  created  by 
the  automated  model  builder  and  by  the 
“queuing  up  of  data”  at  a  resource. 

The  system  produces  models  by 
first  examining  each  of  the  resources 
contained  within  a  system  design,  the 
operating  system  (if  applicable)  execut¬ 
ing  on  each  of  these  resources,  and  the 
tasks  executing  on  each  of  these  re¬ 


sources.  Additionally,  information 
about  each  of  these  components,  as 
well  as  PDL  descriptions  of  tasks  and 
services  are  used  to  locate  and  complete 
corresponding  constructs  within  the 
Building  Blocks  Table  (see  figures  20  - 
23).  The  Automated  Model  Builder 
attaches  each  newly  created  construct 
or  set  of  constmcts  to  the  “model  under 
construction”.  These  steps  are  repeated 
until  the  list  of  resources,  within  the 
system  design,  are  exhausted.  At  this 
point,  the  model  is  complete  and  ready 
for  analysis  and  simulation. 

Figure  24  documents  the  results 
of  the  analysis  and  simulation  pro¬ 
cesses. 

Work  on  both  the  Building 
Blocks  Table  and  the  system  analysis/ 
design  representation  is  on-going.  The 
focus  is  currently  on  testing  the  analy¬ 
sis/design  representation  on  as  many 
different  types  of  systems  as  possible. 
Future  work  includes  building  up  the 
“Building  Blocks  Table”  to  support 
many  different  designs. 
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If 

event  physical  device  clock- Cl. interrupt 
suspended  queue,  priority  0 

Then 

ready  queue,  priority  0 

If 

CPU  SDP  available 
ready  queue,  priority  0 

Then 

selected  ready  queue,  priority  0 

If 

CPU  SDP  available 
ready  queue,  priority  I 

Then 

selected  ready  queue  priority  1 

Unless 

ready  queue,  priority  0 

If 

task  task  SDP-ciock,  priority  0 
selected  ready  queue,  priority  0 

Then 

operating  system-SDP 

If 

operating  system-SDP 

Then 

task  task  SDP-clock,  priority  0  START 
task  running 

Fiqure  17 


If 

event  physical  device  clock-CI.interrupt 
suspended  queue,  priority  0 

Then 

ready  queue,  priority  0 

If 

CPU  SDP  available 
ready  queue,  priority  0 

Then 

selected  ready  queue,  priority  0 

If 

CPU  SDP  available 
ready  queue,  priority  1 

Then 

selected  ready  queue  priority  1 

Unless 

ready  queue,  priority  0 

If 

task  tasK  SDP-clock,  priority  0 
selected  ready  queue,  priority  0 

Then 

operating  system-SDP 

If 

operating  system-SDP 

Then 

task  task  SDP-clock,  priority  0  S'^ART 
task  running 

Fiqure  18 


K  task  SDP-clock.if  clock-Cl.compare_ticksiz£_interrupt  count  =  true 

task  running 

Then  task  SOPclock  idle 

task  running 
CPU  SDP  available 
suspended  queue,  priority  0 

Probability  associated  with  first  condition  of  If-statement:  OJS 

If  task  SDP-clockJf  clock-Cl.compare_ticksizeJnterrupt  count  -  true 

task  running 
Then  wait  3 

task  SDP-clock.  clock*ClJsend  signal  to  calendar  requested, 
task  running 

Probability  associated  with  first  condition  of  If-statement:  0.2 
If  wait  3 

task  SDP-ciock.clock-Cl.$end  signal  to  calendar  completed 
task  running 

Then  cpu  SDP  available 

suspended  queue,  priority  0 


Figure  19 


Construct 


Clock-Driven  Tasks 


Internal  Event-Driven  Tasks 


External  Event-Driven  Tasks 


Multitasking  with  Static 
Priorities 


Selection  of  Ready  Queue 
with  Static  Priorities 


Selection  of  Task  with  Static 
Priorities 


If-Then 


Formalism 


see  External  Event-Driven 
Tasks 


see  Selection  of  Ready  Queue 
with  Static  Priorities 
see  Selection  of  Task  with 
Static  Priorities 


Figure  20 


Selection  of  Task 
with 

Static  Priorities 


// 

Task  <task  name> .  priority  <n>  selected 

Then 

ready  queue,  priority  <n> 

<OS> 

If 

<os> 

Then 

Task  <task  name> ,  priority  <n>  START 
Task  running 

Event:  <OS>  process  scheduler 

Time  associated  with  event:  <05>  . process  scheduling  <service  time> 

Figure  22 
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Results: 


Performance  Analysis 

System  throughputs  (task  and  service  trhoughputs) 
System  resource  utilizations  (hardware  and  software) 
Average  queue  lengths  of  system  resources 
Occurrences  of  system  deadlock 
Livelock 

Safety  Analysis 

Cost  Analysis 

Reliability,  Maintainability,  and  Supportability  Analysis 
Fault  Tolerance  Analysis 


Figure  24 
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ABSTRACT 

(As  both  Systems  Engineering  and  Software 
Engineering  mature,  care  must  be  taken  to  ensure  that  the 
interface  between  the  two  disciplines  supports  the  passage 
of  information  as  ‘smoothly”  as  possible  (i.e..  is  neither 
I  ltdx)r  intensive  nor  error  prone).  Several  current  problems 
I  are  idratified,  and  a  solution  to  the  “Babel  of  Notations” 
problem  is  jH-oposed. 

J  1.  INTRODUCTION 

In  his  fantasy  series  about  the  mythical  magical  world 
of  Xanth,  Piers  Anthony  describes  a  chasm  full  of  dragons 
1  which  divides  Xanth.  The  chasm  has  a  “forget  spell” 
f  attached  to  it,  so  that  one  forgets  it  is  there  unless  within 
50  yards  of  it  Characters  in  his  novels  are  continually 
I  mal^g  plans,  and  then  suddenly  discovering  (actually,  re- 
f  discovering)  the  chasm  as  they  near  it,  and  having  to  battle 
dragons  to  get  across.  Of  course,  after  they  have  finally 
crossed  it,  as  they  depart  they  forget  that  it  exists. 

This  is  a  g^  analogy  for  the  chasm  that  separates 
*  systems  from  software  engineering  —  a  chasm  with  a 
“forget  spell”.  Software  engineering  literature  is  full  of 
discussions  of  how  to  do  requirements  and  designs  of  real 
\  time  software  residing  in  embedded  systems,  but  it  is  only 
when  they  start  to  work  on  a  real  project  that  they 
,  rediscover  that  they  must  get  their  requirements  from  the 
I  systems  engineers.  Similarly,  systems  engineers  perform 
their  front  end  studies,  with  full  knowledge  that  software 
will  have  to  be  developed;  but  it  is  only  WHEN  they  get 
1  ready  to  turn  over  requirements  to  the  software  engineers 
t  that  they  rediscover  that  WHAT  they  are  ready  to  turn  over 
docs  not  meet  the  perceived  needs  of  the  software 
engineers. 

I  During  the  past  30  years,  the  complexity  of  systems 

being  implemented  has  by  any  measure  increased  by 
several  orders  of  magnitude  (c.g.,  size  of  code,  size  of 
memory,  number  of  instructions  pw  second).  During  that 
time,  the  interface  between  the  systems  and  software 
engineers  has  suffered  significantly.  With  the  advent  of 
automated  tools  for  both  the  systems  and  software 
engineers,  new  problems  have  prevented  the  desired 
“seamless"  transit  between  tool  sets,  and  it  is  becoming 
imperative  that  this  interface  be  smoothed  out  Unless  this 
I  problem  is  solved,  the  interface  will  remain  both  labor 


intensive  and  error  prone,  and  will  thwart  efforts  to 
improve  both  productivity  of  the  development  {Mocess  as 
well  as  the  reliability  of  the  end  product  operational 
system. 

The  purpose  of  this  paper  is  to  trace  the  development 
of  this  interface  from  its  inception  in  the  1960s,  ktentify 
the  current  issues,  and  to  propose  an  approach  to  one  of  its 
thorniest  problems  ~  the  mapping  of  systems  engineering 
notations  onto  software  engineering  notations. 

2.  BACKGROUND 

The  discipline  of  Systems  Engineering  gained 
prominence  in  the  late  19S0s,  because  it  was  viewed  as  a 
solution  to  the  problems  associated  with  the  development 
of  systems  of  high  complexity  with  engineers  from 
multiple  disciplines.  It  was  so  successful  that  in  the  mid 
1960s  its  use  became  mandated  by  the  Department  of 
Defense  on  ail  military  systems,  and  all  military 
contractors  sponsored  training  courses  for  their  front  end 
engineers  to  gain  a  working  knowledge  of  its  concepts. 

In  1968,  the  term  “software  engineering”  was  coined, 
and  concepts  of  “programming”  and  "software 
development”  were  matured  into  those  of  “engineering  the 
software".  An  attempt  was  made  by  systems  engineers  to 
treat  software  as  "just  another  component”.  Systems 
engineers  allocated  functions  to  software  components,  and 
specifications  were  written  to  document  those  allocations; 
then  software  developers  developed  software  to  satisfy  the 
requirements  in  the  specifications. 

To  understand  the  issues  related  to  the  interface 
between  systems  and  software  engineering,  it  is  useful  to 
step  back  and  review  both  the  fundamental  culture  of 
systems  engineering,  and  the  way  in  which  the  interface 
was  viewed  as  software  engineering  was  developing.  This 
then  sets  the  stage  for  understanding  the  current  interface 
between  them. 

2.1  Culture  of  Systems  Engineering.  The 
basic  concept  of  systems  engineering  presented  in  the 
1960s  is  rather  simple  and  elegant,  as  illustrated  in  Exhibit 
1.  Systems  engineers  are  responsible  for  translating 
customer  goals,  desires,  and  “requirements”  into  an 
integrated  functional  description  of  the  black  box  behavior 
of  a  system  and  associated  performance.  This  behavior  is 
reviewed  with  the  customer  to  gain  concurrence,  and  then 
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these  functions  and  their  performance  are  decomposed  and 
allocated  to  components,  thus  providing  a  systematic 
method  of  exploring  the  design  space.  Each  design  is 
evaluated  by  compcment  developers  fcK*  feasibility,  cost 
effectiveness,  schedule  and  risk,  and  the  process  iterated 
until  an  optimized  (or  at  least  acceptable)  design  is  found. 
In  addition,  the  designs  are  evalumed  by  the  engineering 
specialties  (e.g..,  reliability,  availability,  logistics,  human 
interface,  training,  manufacturability)  to  enatre  that  these 
aspects  of  the  design  are  acceptable  as  well.  When  there  is 
consensus  on  feasibility,  acceptability,  and  cost- 
effectiveness  of  a  design  by  ail  players  (including  the 
customer),  this  design  becomes  the  baseline  descripdon. 
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Exhibit  1  Fundamental  Concepts  of  Systems 
Engineering 


The  component  developers  thus  play  three  different 
roles: 

during  the  systems  analysis  and  early  design  phases,  the 
component  developer  provides  estimates  of  feasibility, 
cost,  schedule,  and  risk  implications  of  proposed  aliocatitMi 
of  functions  and  performance  to  the  component  to  aid  the 
systems  engineer  in  defining  the  system  black  box 
behavior.  These  responses  are  usually  performed  in 
working  groups,  with  little  formality,  but  are  vital  to  the 
assessment  of  feasibility  of  a  system  design  and  to  sufqxirt 
uadeofis. 

during  the  end  of  the  system  design  phase,  a  system 
specification  containing  the  allocation  of  requirements  to  a 
component  is  reviewed  by  the  component  developer  to 
ensure  agreement  with  proposed  cost,  schedule,  and  risk 
assessments;  and 

during  development  phase,  the  component  developer  is 
responsible  for  development  of  the  component  to  satisfy 
the  requirements.  When  completed,  the  component 
developer  assist  integration  and  test 

The  mechanism  for  defining  the  system  behavior 
originally  promulgated  in  the  1960s  was  to  use  the 
Function^  Row  Block  Diagrams  (FFBDs).  This  provided 
a  hierarchical  approach  for  the  definition  of  the  timelines  of 
function  execution.  The  original  applications  of  the 
approach  was  the  design  of  a  missile  and  its  launch  time 
line,  so  there  was  a  significant  bias  towards  representation 
of  sequences  and  concurrencies  of  functions.  Exhibit  2 
presents  an  example  of  the  notation. 


At  this  point,  development  shifts  to  the  design  of 
components  to  satisfy  the  allocated  requirements.  Systems 
Engineers  monitor  the  design  to  ensure  that  the 
requirements  will  be  satisfied,  monitor  the  integration  and 
test  activities,  and  p  xess  the  hundreds  (or  even  thousands) 
of  change  orders  resulting  from  changing  customer 
requirements  and/or  component  designer  feedback. 

The  criteria  d'ctating  the  level  of  detail  of  the 
functionality  allocated  to  the  components  is  simple:  the 
black  box  behavior  of  the  component  is  describ^  to  the 
level  of  detail  necessary  for  the  component  developers  to 
complete  their  design  without  reference  to  how  other 
components  are  being  developed.  In  particular,  all 
interactions  are  to  be  identified  and  specified.  It  is  the 
Systems  Engineer’s  job  to  ensure  that  if  all  component 
developers  satisfied  their  requirements,  then  the 
components  will  integrate  correctly  to  satisfy  the 
customer’s  requirements.  This  approach  allows  fairly 
complex  systems  to  be  developed,  because  component 
developers  need  only  to  satisfy  their  allocated  requirements 
and  interfaces  -  this  limits  the  amount  of  information 
needed  to  develop  any  component  The  components  are 
usually  divided  by  discipline  (e.g.,  propulsion,  structures, 
electronics)  so  that  one  trained  in  that  discipline  can 
complete  the  designs  of  the  components. 


♦ 


Exhibit  2.  Sample  FFBD  -  Hierarchy  of 
Functions 

The  importance  of  the  choice  of  FFBDs  to  the 
interface  between  Systems  and  Software  Engineering  will 
be  discussed  further  below. 

2.2  SE/SWE  Interface  in  the  1960s.  During 
the  1960s,  selection  of  the  computers  for  a  system  was 
driven  by  the  environmental  and  sizing  requirements. 
Reliability  for  the  computer  component  was  dictated  by  the 
reliability  of  the  computer  hardware.  By  today’s  standards, 
the  computers  were  quite  small  in  both  memory  and 
execution  speed,  and  hence  the  complexity  of  the  software 
was  limited.  Thus  the  allocation  of  functions  to  the 
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computer  was  equivalent  to  the  allocation  of  functions  to 
the  software  with  sizing  constraints. 

To  get  an  appreciation  for  how  the  interface  between 
systems  and  software  engineering  has  evolved,  it  is 
instructive  to  read  the  systems  engineering  textbooks  of 
the  lS160s  and  early  197(^.  For  example,  in  his  book  The 
Management  of  Systems  Engineering,  Wilton 
Chase  devotes  an  endie  chapter  to  “Computer  and  Software 
Systems  Engineering”.  Some  interesting  excerpts: 

“Designing  a  computer  program  ...  requires  the 
application  of  pure  logic  for  devising  the  information 
processing  routines.  ...  Because  computer 
programming  is  strictly  an  “intellectual  exercise,  its 
effective  system  engineering  is  the  most  difficult 
aspect  of  designing  a  complex  man-machine  system.” 
[Chase,  p  93]. 

"The  most  critical  step  in  a  software  design  and 
development  effort  is  the  startup  requirements 
analysis.  The  miss-allocation  of  software 
requirements  occurs  when  the  analysis  and  definition 
is  split  among  several  functional  organizational  areas 
making  proper  technical  coordination  nearly 
impossible.  An  effective  means  of  avoiding  this 
problem  is  to  assign  the  total  responsibility  for 
determining  software  requirements  to  a  central 
systems  engineering  activity.”  [CHASE,  p.lOO]. 

Topics  covered  in  this  chapter  include  determinadon  of 
computer  capacity,  definition  of  functions, 
computer/software  design  tradeoffs,  software  design 
(including  flowcharting  by  the  systems  engineer),  coding, 
documentation,  human  interface,  and  tesung.  Detailed 
design  and  coding,  and  unit  test  were  relegated  to 
“programmers”.  In  other  words,  systems  engineers  were  in 
charge  of  the  top  level  “software  design”. 

This  interface  was  somewhat  similar  to  the  interface 
between  systems  engineers  and  analog  control  engineers: 
the  analog  control  loops  were  developed  by  the  control 
engineers,  and  then  imposed  on  the  component  developers 
as  requirements. 

The  conclusion  drawn  from  this  discussion  is  that 
during  the  1960s  the  software  was  not  viewed  as  a 
component  on  the  par  with  other  components  --  to  the 
extent  that  software  was  considered  as  a  component,  the 
systems  engineers  were  in  charge. 

2J3  Trends  in  the  late  60s  and  70s.  During 
the  next  decade,  a  number  of  trends  occurred  which  made 
the  interface  between  systems  and  software  engineers  more 
complex. 

First,  as  the  computing  hardware  became  larger,  the 
size  and  complexity  of  software  became  larger,  litis  had 
several  predictable  consequences.  The  design  of  the 
software  into  units  developed  by  programmers  became 
increasingly  important  so  that  more  software  developers 
could  be  used,  and  hence  management  of  multiple 
developers  became  a  critical  problem. 


To  cope  with  this,  software  development  was  upgraded 
into  Software  Engineering,  with  emphasis  on  design 
techniques,  cost  estimation,  design  for  maintainability,  ac. 
Thus  software  developers  were  no  longer  regarded  as 
“technicians”,  but  as  a  separate  discipline,  which  should  be 
in  charge  of  the  software  component.  Structured  Analysis 
techniques  ,  using  some  version  of  Data  Flow  Diagrams, 
began  to  be  used  to  discipline  the  software  design  process. 

In  addition,  software  became  a  Configuration  Item 
(i.e.,  a  component),  with  a  software  developer  placed  in 
charge  of  it  This  seemed  to  place  software  on  the  same 
par  as  other  components,  but  there  was  an  important 
difference  --  the  computing  hardware  was  usually  regarded 
as  a  separate  component,  selected  by  systems  and  hardware 
engineers  to  satisfy  cost,  reliability,  capacity,  and  logistics 
requirements.  Thus  the  primary  issue  facing  the  software 
manager  was  whether  all  of  the  software  could  be  devek^ied 
on  time  and  fit  inside  of  the  pre-selected  hardware.  This 
meant  that  the  software  manager  was  not  a  full  member  of 
the  component  development  team. 

The  interface  that  evolved  between  the  systems  and 
software  engineers  during  this  time  had  some  unfortunate 
characteristics.  Systems  engineers  still  defined  functions 
of  the  system  (organized  by  the  FFBDs),  and  allocated 
some  of  the  functions  to  the  software  component;  and 
these  allocations  were  documented  in  specification 
documents  (e.g.,  MIL-STD  490  B5  Computer  Software 
Configuration  Item).  Software  developers  then  began  their 
software  analysis  from  these  documents.  Thus  the 
information  content  developed  at  the  system  level  (with 
timeline  orientation)  was  reduced  to  textual  testable 
requirements  on  functions  and  performance,  which  was 
then  translated  into  a  data  flow  diagram  terminology.  This 
incompatibility  between  the  system  and  software 
descriptive  was  partially  masked  by  the  need  for  an 
intermediate  hardcopy  document;  non  the  less,  it  is  a 
problem  that  has  remained  into  the  1990s. 

Finally,  as  systems  complexity  increased,  the  systems 
engineers  abrogated  their  responsibility  to  provide 
complete  requirements  to  the  software  component,  and  the 
software  developers  did  not  pick  it  up.  Although  the 
concepts  were  available  (e.g.,  [ALFO]),  few  software 
specifications  provided  a  complete  set  of  testable 
statements  constraining  the  accuracy  and  response  times  of 
each  required  action  of  the  software  (i.e.,  input,  condition, 
output)  residing  in  its  computer.  The  value  for  such 
requirements  in  such  a  format  has  not  been  recognized  by 
the  majority  of  practicing  software  engineers  even  today. 

In  addition,  the  systems  engineers  abrogated  their 
responsibility  to  define  the  interface  between  the  system 
operator  and  the  software  component  (and  thus  verify  that 
the  human/computer  interface  standards  required  for  training 
were  satisfied).  These  issues  were  deferred  not  just  until 
software  requirements,  but  in  many  cases  past  preliminary 
design  and  even  detailed  design  until  the  coding  phase, 
where  they  were  not  reviewed  at  all.  This  became  the 
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prime  cause  for  the  generally  wretched  human  computer 
interfaces  exhibited  by  the  generated  software. 

2.4  Trends  in  the  80s.  During  the  1980s,  the 
capacity  of  iMt>cess(»^  continued  to  increase,  and  hence 
iiKtease  in  system  complexity  got  worse.  But  some  new 
trends  emerged  that  exacerbated  the  systems/software 
oigineering  interfs:e  {Hoblem. 

First,  as  processors  got  faster  and  smaller,  and 
communications  capability  increased,  most  embedded 
computing  systems  became  distributed.  Unfortunately,  the 
design  of  the  diaributed  system  became  an  orphan,  claimed 
by  neither  the  systems  engineers,  nor  the  computer  1) 
hardware  engineers,  nor  by  the  software  engineers.  The 
systems  engineers  did  not  claim  it  because  it  was  perceived 
as  a  component  design  issue;  the  computer  hardware  la) 
engineers  could  not  claim  it  because  the  critical  issue  was 
the  allocation  of  processing  to  processors:  and  the  software 
engineers  were  not  prepared  to  deal  with  the  issues  of 
computer  system  reliability,  logistics,  etc. 

Secondly,  due  to  advances  in  computer  chip 
technology,  it  became  possible  to  make  faster  computer  lb) 
systems  by  use  of  specialized  computer  chips,  thus 
requiring  a  hardware-software  tradeoff.  Again,  such 
tradeoffs  became  an  orphan  ->  the  systems  engineers 
couldn’t  do  it  because  of  the  low  level  of  detail  needed  to 
perform  such  tradeoffs,  while  the  computer  hardware  and 
software  engineers  did  not  speak  a  common  language 
needed  to  perfonn  the  tradeoffs. 

Another  consequence  of  the  advances  in  computer 
hardware  was  that  it  became  cost  effective  to  place  high 
performance  engineering  workstations  on  the  desks  of  the 
engineers  to  provide  aids  for  system  and  component 
specification  and  design.  The  electronic  designers  were 
aided  by  the  CAE  tools  (e.g.,  schematic  capture  and 
simulation,  the  schematic  layout  tools).  The  software 
developers  were  aided  by  CASE  tools.  And  finally,  the 
systems  engineers  were  provided  with  system  design 
automation  tools  (e.g.,  interactive  simulators,  automated 
traceability  support,  and  tools  to  support  systems  analysis 
and  design  synthesis). 

As  these  tools  matured,  a  consensus  emerged  that  the 
requirements  and  design  specifications  should  be  executable 
to  eliminate  ambiguities,  incompleteness  and 
inconsistencies.  In  the  CAE  world,  a  standard  executable 
model  emerged  (i.e.,  VHDL).  Unfortunately,  in  the  CASE 
world,  a  plethora  of  tools  emerged  using  a  variety  of 
modeling  notations  (i.e..  Data  Flow  Diagrams,  Control 
Flow  Diagrams,  Petri  Nets,  SADT/IDEFO  diagrams. 
Object  Oriented  requirements  models,  the  Mills  Black  Box 
notation,  etc.).  At  the  system  level,  some  used  various 
simulation  models,  while  others  used  an  FFBD  notation 
extended  into  an  executable  timeline  oriented  behavior 
model.  Yet  the  interface  has  remained  the  same  as  in  the 
1970s:  systems  engineers  model  the  system,  write  paper 
specifications,  then  computer  and  software  engineers 


interpret  the  paper  requirements  to  develop  their  own 
models  to  support  the  next  layer  of  design.  This  interface 
is  now  even  more  labor  intensive  and  error  prone  than 
[veviously.  In  many  cases,  the  software  engineering  starts 
before  the  systems  engineering  has  started,  much  less 
produce  preliminary  outputs  to  the  software  developers.. 

3.  SUMMARY  OF  ISSUES  AND  SOME 
APPROACHES 

Analysis  of  the  above  trends  gives  rise  to  six  general 
issues  that  must  be  addressed  in  order  to  strengthen  the 
systems/softwaie  engineering  interface: 

Where  does  systems  engineering  stop  and 
software  engineering  begin?  Two  critical  pans  of 
this  issue  are: 

who  does  the  human/computer  interface?  On  the 
one  hand,  systems  engineering  is  ultimately  re^xxisible 
for  ensuring  that  the  HCI  satisfies  the  required  training 
standards.  On  the  other  hand,  we  have  painfully  learned 
that  rapid  proto^ing  is  necessary  to  define  such  interfaces 
early  m  Older  to  estimate  the  size  of  the  required  software, 
who  does  the  distributed  design?  On  the  one  hand, 
a  systems  engineering  job  must  be  done  on  the  distributed 
computer  component  to  deal  with  issues  of  reliability, 
availability,  logistics,  sizing,  etc.;  yet  much  of  the  fault 
tolerance  will  be  implemented  in  software. 

There  is  current  movement  to  collect  the  issues  of 
distributed  design  (including  capacity,  fault  tolerance, 
reliability,  availability,  dependability,  and 
hardware/software  tradeoff's)  into  a  Computer  Based 
Systems  Engineering  discipline  (e.g..  see  [LAV!]).  This 
doesn’t  completely  solve  the  problem,  but  designates  a 
new  role  and  defines  a  discipline  responsible  for  performing 
the  distributed  and  hardwaie/software  tradeoff  design  of  a 
computer  system  in  terms  of  its  components  (i.e., 
computer  hardware  components,  communication 
components,  and  computer  software  components). 
Although  this  replaces  one  interface  problem  (ix.,  systems 
engineering/software  engineering)  with  two  ~  systems 
engineering  to  computer  systems  engineering  and  computer 
systems  engineering  to  software  engineering  —  I  believe 
this  is  the  right  approach. 

How  detailed  is  the  specification  of  the 
computer  system/software?  It  is  a  systems 
engineering  responsibility  to  define  the  interactions 
between  components  to  a  sufficient  detail  that  component 
develcqiers  can  develop  the  components  independently. 
This  argues  for  an  executable  specification  of  the 
computer  component,  and  equally  argues  for  an 
executable  specification  of  the  software  component  at 
the  (stimulus,  condition,  response)  level,  with  performance 
requirements  on  response  time  and  accuracy  for  specified 
arrival  rates.  Anything  less  than  this  is  ^most  guaranteed 
to  contain  ambiguity  and/or  be  incomplete. 

This  imposes  requirements  on  both  sets  of 
participants  -  the  requirements  for  the  systems  engineer  to 
generate  such  executable  specifications,  and  the  obligation 
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for  the  software  engineer  to  demonstrate  that  they  have 
been  satisAed  (i.e.,  the  speciAed  sequences  of  stimulus- 
condiAon-response  behavior  have  been  preserved). 
Although  such  speciAcaAons  are  currenUy  feasible,  the 
culture  of  systems  engineenng  is  only  now  beginning  to 
recognize  its  obligaAon. 

How  is  the  specification  information  passed  — 
in  a  paper  specification  or  by  ^‘database”  on 
electronic  media? 

The  CALS  iniAaAve  is  clearly  moving  towards 
electronic  media  for  captunng  relevant  requirements/design 
information.  One  emerging  view  is  that  the  paper 
speciAcaAon  should  be  a  “text  view”  of  the  “model” 
deAning  component  behavior,  and  that  the  data  base  of 
allocated  behavior  should  constitute  the  “real” 
speciAcaAon.  Note  that  unless  this  is  the  case,  the 
transiAon  from  systems  engineering  model  to  paper  to 
software  requirements  model  will  be  both  Ame  consuming 
and  error  prone.  This  gives  rise  to  the  next  problem. 

What  is  the  notation  of  the  model  passed  to 
software  engineering? 

The  problem  here  is  that  systems  engineers  have  their 
notaAons  (e.g.,  Ame  oriented  FFBD  notauons,  and  their 
extensions),  and  software  engineers  have  their  notations 
(e.g.,  DFD  or  concurrent  state  machines).  It  does  no  good 
to  insist  that  one  of  them  change  to  using  the  notation  of 
the  other  ~  they  have  two  different  cultures,  and  cultural 
inertia  and  cost  of  training  suggests  that  this  soluAon  is 
infeasible..  On  the  other  hand,  if  two  different  notaAons 
are  used,  then  the  mapping  from  one  to  the  other  must  be 
automated,  or  the  interface  will  be  Ame  consuming  and 
error  prone,  not  just  for  the  iniual  speciAcaAon,  but  for 
each  change  thereafter. 

A  signiAcant  problem  here  is  that  there  is  not  a  single 
software  engineering  notaAon  -  A1  Davis’  book  [DAVIS] 
descnbes  a  number  of  different  notaAons  in  use  today  (e.g., 
DFD,  CFD,  Petri  Nets,  Concurrent  State  Machines, 
StateCharts,  object  onented  analyses,  etc..).  DeAniAon  of 
an  automated  mapping  from  system  to  software 
requirements  is  perhaps  the  biggest  theoreAcal  and  pracAcal 
problem  reiaAng  to  the  interface.  An  approach  to  solving 
this  problem  is  presented  in  die  next  secUon  below. 

What  is  the  nature  of  the  dialog  (including  the 
feedback)  between  system  and  software 
engineering?  For  example: 

■  when  required  processing  is  allocated  to  the  computer 
component  OparAcularly  the  response  times  and  accuracy  of 
processing),  feedback  is  required  on  the  feasibility  and 
cost/schedule  implicaAons  of  such  allocauons  (necessary  to 
support  hAv  s/w  tradeoffs); 

/hen  processing  requirements  are  allocated  between  computer 
and  operators,  a  rapid  prototype  may  be  required  to  ensure 
that  the  software  behaves  as  the  user  expects 

Again,  the  issue  here  is  to  provide  a  rapid  feedback 
mechanism  to  speed  up  the  system  design  process. 
Finally, 


6)  If  the  systems  engineers  provide  an  executable 
specification  of  the  software  component  viewed 
as  a  black  box,  how  will  the  software 
engineering  methods  demonstrate  (or  prove) 
that  the  black  box  behavior  (sequences  of 
stimulus-condition-response)  has  been 
(provably!)  preserved  by  the  software  design? 

Current  software  development  techniques  are  not 
emrendy  oriented  towards  this  problem. 

4.  REDUCING  THE  BABEL  OF  NOTATIONS 

As  noted  above,  one  of  the  thorniest  problems  with 
the  interface  between  systems  and  software  engineering 
(with  or  without  a  CBSE  role)  is  the  problem  of  differing 
notaAons.  The  philosophy  of  Ascent  Logic  CnrporaAon 
recognizes  that,  since  systems  engineering  is  the 
interdisciplinary  engineering  discipline,  it  is  the  obligaAon 
of  the  systems  engineer  to  present  informaAon  in  the 
language  of  the  component  developers;  hence  it  is  the 
obligaAon  of  the  systems  engineering  tools  to  provide  an 
automated  interface  to  downstream  tools.  The  approach 
developed  to  address  this  problem  is  four  fold: 

a)  system  behavior  is  described  using  "behavior  diagrams 
(BDs)”,  which  are  an  executable  extension  of  the  FFBDs. 
An  eiement-relaAon-attribute-stiuctures  data  store  is  used  to 
keep  informaAon  about  these  diagrams,  their  contents  and 
traceability. 

b)  as  addlAonal  notaAons  are  considered,  an  attempt  is  made 
to  perform  a  mapping  from  the  BDs  onto  the  notaAon  -• 
the  tool  data  store  is  extended  to  capture  any  addiuonal 
informaAon  which  is  not  yet  represented  in  the  current 
notaAon;  and 

c)  an  editor  is  developed  to  generate  the  new  notaAon  from 
the  data  store  using  proJecAon  (i.e.,  Anding  the  applicable 
subset  of  informaAon  needed  to  display  the  notation)  and 
rules  needed  to  display  Aiis  informaAon  subset  in  the 
notaAon’s  syntax. 

(Q  information  is  output  in  the  input  syntax  of  the 
downstream  tool  which  is  used  by  the  downstream 
developer.  Hopefully,  there  is  a  starxlard  for  the  syntax  and 
semanAcs  of  such  informaAon  which  is  accepted  by  many 
different  tools(e.g.,  VHDL  for  CAE  tools,  CDIF  for 
CASE  tools). 

The  working  hypothesis  is  that,  since  the  BDs  can  be 
used  to  describe  die  observable  behavior  of  any  system  or 
component  viewed  as  a  black  box,  then  there  should  be  a 
mapping  onto  other  notaAons  which  are  used  to  describe 
the  same  “black  box".  So  far,  this  hypoAiesis  has 
withstood  the  test  of  a  number  of  other  ncuaAons.  After  a 
presentaAon  of  the  Behavior  Diagram  notation,  an 
overview  of  the  mappings  to  a  representaAve  set  of 
notaAons  is  presented  ~  the  mappings  onto  all  known 
notaAons  would  surpass  the  page  limitaAons  of  Aiis  paper. 

4.1  Behavior  Diagrams.  The  behavior  diagram 
notaAon  was  developed  by  merging  the  concepts  of  the 
systems  engineering  FuncAonal  Flow  Block  Diagram  (i.e.. 
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functions,  sequences,  selections,  concunencies,  timelines), 
flow  notations  (i.e.,.  flows  of  items  between  functions,  as 
in  lOEFO  or  Data  Row  Diagrams),  Graph  Models  of 
ComputadtMi  (i.e..  showing  multiple  exits  of  a  Function), 
Hierarchy  of  control  concepts  (i.e.  deflning  replicated 
concurrent  functions),  and  explicit  labeling  of  f’uiction  exit 
conditions  and  performaiKe.  The  result  was  an  executable 
notation  which  could  be  used  to  precisely  deflne  the 
intended  behavior  of  a  system.  It  was  then  augmented  to 
describe  inter&ce  designs  and  fauilt  detecdon^ecovery. 

The  foundation  of  the  behavior  diagram  notation  is  the 
concept  of  Discreteltems  processed  by  DiscreteFunctions. 
A  Discreteltem  may  have  contents,  but  arrives  logically  as 
a  unit  at  an  identifiable  moment  in  time.  The  Discreteltem 
can  be  used  to  represent  either  a  physical  thing  (e.g..  a 
peach)  or  a  set  of  data.  Some  Discreteltems  (called  state 
items)  contain  the  partially  processed  results  of  previous 
functions  operating  on  previous  input  items,  and  are  passed 
down  to  subsequent  functions  for  use  in  processing  future 
arriving  items.  The  Discreteltems  entering  or  exiting  the 
system  boundary  are  by  definition  the  “observables"  of  the 
system.  The  Discreteltems  are  represented  on  a  graph  by  a 
shaded  oval  containing  its  ruime. 

Exhibit  3  presents  a  DiscreteFuncticm.  represented 
graphically  by  a  shaded  rectangle.  Time  flows  from  top  to 
bottom,  so  the  line  at  the  top  of  the  DiscreteFunction 
carries  the  enablement  from  a  previous  function.  When 
enabled,  the  Discrete  Function  waits  for  the  arrival  of  a 
Discrete  Item  ~  in  the  diagram  below,  only  one 
Discreteltem  A  is  shown,  but  more  than  one  is  possible. 
When  the  flrst  item  arrives,  any  combination  of  the 
output  items  (e.g.,  B  and/or  Q  can  te  generated  (including 
the  state  item  S2)„  and  one  of  its  exits  is  taken  (which 
enables  some  subsequent  function),  after  a  finite  duration. 
If  a  non-designated  item  arrives,  this  is  assumed  to  be  an 
error  (e.g..  input  out  of  sequence).  The  exits  are  labeled 
with  the  names  of  the  conditions  (e.g..  Cl  and  C2)  to  be 
met  to  take  the  exit  Exits  can  be  classified  as  normal, 
exceptional,  or  “timeout”  (which  is  uken  if  no  input 
arrives  within  a  designated  period  of  time).  The 
DiscreteFunction  can  also  be  defined  to  require  a  designated 
resource  amount  and  if  this  is  not  available  when  the 
input  item  arrives,  the  function  will  wait  for  the  resource 
availability. 


Exhibit  3  Example  Discrete  Function 


A  DiscreteFunction  can  be  decomposed  into  a  “Response 
Net",  or  RNet  to  define  how  the  contents  of  the  input 
items  are  used  to  determine  conditions  under  which  each 
output  item  and  condition  is  selected,  and  are  used  to 
generate  the  contents  of  the  output  items.  The  definition 
of  a  DiscreteFunction  is  a  mild  extension  of  the  “function" 
of  a  finite  state  machine  ••  it  is  allowed  to  generate 
combinations  of  outputs  (not  just  a  single  output),  its 
execution  may  require  the  availability  of  a  designated 
resource,  and  the  state  items  (e.g.,  containing  state 
information)  are  explicitly  defined.  The  RNets  and  their 
contents  are  equivalent  to  a  completed  decision  table 
defining  the  stimulus-condition-response  of  a  function.  An 
example  of  an  RNet  is  shown  below,  which  accepts  A,  B, 
or  C  and  outputs  either  X  and  Y  or  Z  or  nothing 
(depending  on  the  arrival  and  value  of  a  condition  CCl, 
then  generates  the  state  item  and  selects  the  appropriate 
exit 


Exhibit  4  Example  Response  Net 

Sequences  of  Functions  are  described  with  a  Function  Net, 
or  FNet  containing  functions,  as  shown  in  Exhibit  5. 
Time  flows  from  top  to  bottom  (indicating  anival  of  items 
to  be  processed)  and  left  to  right  (reflecting  inputs 
transformed  into  outputs).  Looking  at  the  black  box 
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boundary  of  Exhibit  S,  note  that  first  a  Peach  arrives,  then 
a  Can;  that  first  a  peach  Skin,  then  a  Pit,  and  then  Canned 
Slices  exits  the  boundary  —  these  are  all  explicitly 
observable.  The  processing  is  described  as  a  sequence  of 
three  functions.  Note  that  state  items  pass  from  “Skin  a 
Peach”  to  “Slice  a  Peach”  and  from  “Slice  a  Peach"  to 
“Can  a  Peach”,  and  since  they  are  “insitte  the  box"  they  are 
not  observable.  The  Function  “Can  a  Peach”  cannot 
execute  until  the  Xan”  arrives. 


Exhibit  5  Example  Sequence 


This  notation  is  executable,  i.e.,  it  can  be  executed  to 
yield  the  times  of  the  outputs  from  the  times  of  the  inputs 
and  the  duration’s  of  the  functions.  The  Figure  below 
presents  a  sample  timeline.  Note  that  the  empty  bar 
indicates  a  period  of  time  when  a  function  is  enabled  but 
waiting  for  an  input,  and  a  dark  filled  bar  indicates  the 
duration  required  for  the  execution  of  the  function. 
Outputs  are  available  when  the  function  completes. 


Skin  a  Peach 

ENABLED - f 

wait  for  peach' 

Peach  arrives 

Produce  Skin 
&  Skinned  Peach 

EMIT  Skin  &  Skinned  Peach, 
EXIT;  ENABLE  Slice  a  Peach 

PRODUCE  Pit  &  Sliced  Peach 

EMIT  Pit  &  Sliced  Peach, 
EXIT,  enable  Can  a  Peach 

Skin  a  Peach 
Walt  for  Can 
Can  arrives 
Can  the  Peach 
EMIT  Canned  Slices 


Exhibit  6  —  Sample  Timeline 


The  functions  can  be  placed  into  graphs  containing  not 
just  sequences,  but  selections,  iterations,  loops, 
concurrencies,  and  replications  as  well.  The  notation  for 
these  graphic  constructs  appears  below. 


Exhibit  7  Behavior  Graph  Constructs. 


The  replication  construct  requires  a  special  word  of 
explanation.  One  defines  the  domain  of  replication  (e.g., 
for  each  aircraft  in  track,  fw  each  user  of  the  system),  and 
then  defmes  one  function  per  replicate.  In  this  way.  users 
do  not  have  to  deal  explicitly  with  “indices  of  functions”  in 
order  to  define  them.  Fin^y,  a  coordination  function  is 
defined,  which  accepts  status  from  the  replicates  and 
generates  controls  back  to  them;  the  coordination  function 
is  responsible  for  detecting  and  resolving  all  conflicts 
between  the  replicates  (e.g.,  two  aircraft  collide,  or  giving 
priority  to  certain  users  if  there  are  insufficient  resources  to 
service  all). 

To  deal  with  large  models,  a  graph  of  functions  can  be 
aggregated  into  a  ’TimeFunctions”,  i.e.,  a  function  which 
inputs  and  outputs  specified  sequences  of  items. 
Functional  Decomposition  then  reverses  the  aggregation 
process,  defining  a  graph  of  functions  which  preserve 
specific  properties  of  the  original  function  (i.e., 
input/output  content  and  sequence,  number  and  kinds  of 
exits,  and  ability  to  calculate  the  performance  of  the  parent 
function  from  the  performance  indices  of  the  functions  on 
the  graph.  The  aggregation  or  decomposition  procedure 
can  be  recursive  to  organize  a  graph  of  arbitrary  size  and 
complexity  into  a  hierarchy  of  functions  and  their 
decompositions  to  support  understandability. 


Exhibit  8  Allocation  of  Functions  to 
Components 
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When  the  desired  behavior  of  a  system  is  defined, 
functions  are  allocated  to  the  system  components.  The 
Figure  below  illustrates  this  process.  The  black  box 
behavior  of  the  system  is  decomposed  to  the  level  that 
functions  can  be  partitioned  and  allocated  to  the 
components  of  a  postulated  architecture  (e.g..  Cl,  C2  and 
C3,  shown  in  the  upper  right).  Note  that  the  allocation 
yields  the  requirements  for  a  new  interface  function  labded 
IF,  which  is  decomposed  and  allocated  between  sender  and 
receiver  (and  possibly  a  communications  component). 
This  process  can  be  recursively  allied  to  yield  layers  of 
interface  design.  The  resulting  allocated  functions, 
including  those  implementing  the  interface  design,  then  are 
extracted  to  yield  the  black  box  behavior  of  a  component. 
The  extraction  process  can  be  implemented  by  projection 
operators. 

The  RDD  notation  thus  provides  the  system  designer 
with  the  ability  to  define  an  executable  description  of  the 
desired  behavior,  and  its  allocation  to  components.  It 
supports  separation  of  concerns  (e.g.,  separation  of  normal 
from  exceptional  behavior,  single  object  behavior  from 
behavior  to  coordinate  concurrent  functionality,  normal 
from  interface  behavior). 


iteration-concurrency-replication  information,  and 
defaulting  the  flows  to  the  “data”  mode,  and  allowing  the 
user  to  lato'  designate  it  as  a  "control  flow".  Exhibit  10 
presents  the  IDEPO  diagram  that  results  from  the 
application  of  this  i»tx:edute. 
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4,2  Systems  Engineering  Notations 
(FFBDs,N-Squared  charts,  IDEFO).  The  BD 
notation  is  mapped  onto  the  Systems  Engineering 
notations  in  order  to  present  practicing  systems  engineers 
information  in  a  notation  in  which  they  were  trained. 
Although  the  BD  notation  was  synthesized  from  the 
FFBDs  and  other  notations,  it  is  useful  to  actually 
construct  publishable  FFBDs  from  the  BDs.  For  example, 
the  FFBD  in  Exhibit  2  was  created  from  a  BD 
automatically  by  eliminating  the  inputs,  outputs,  and 
ccMiditions,  and  displaying  the  resulting  function  sequences 
and  concurrencies  in  a  left  to  right  format. 

A  corresponding  N-Squared  chart  is  constructed  by 
eliminating  the  function  sequence,  concurrency  and 
conditions,  and  external  inputs  and  outputs.  The  resulting 
functions  and  their  flows  are  represented  by  placing  the 
functions  on  the  diagonal  of  a  matrix,  then  placing  a  circle 
denoting  (low  on  the  (ij)  matrix  element  to  represent  the 
flow  from  function  i  to  function  j.  A  publishable  N- 
Squared  chart  is  presented  in  Exhibit  9. 

Many  systems  engineers  from  the  manufacturing  area 
use  the  IDEFO  flow  notation  (a  variant  of  the  SADT 
notation  developed  by  Doug  Ross,  see  [ROSS]).  This 
notation  describes  essentially  the  same  information  as  the 
N-Squared  chart,  but  describes  external  inputs  and  outputs 
as  well  as  internal.  The  functions  are  arranged  on  a 
diagonal  as  with  the  N-Squared  Chart,  but  labeled  arrows 
are  used  to  describe  flows.  In  addition,  if  a  flow  is 
designated  as  “data”,  it  enters  the  side  of  a  function  box;  if 
the  flow  is  designated  as  "primarily  a  control",  then  it  is 
shown  to  enter  the  top  of  a  function  box.  These  are 
generated  from  BDs  by  ignoring  the  sequence-conditions- 


Exhibit  9  Example  N-Squared  Chart 

These  transformations  are  performed  automatically  by  the 
RDD-100  System  Designer  using  an  interactive  e^tor  for 
each  type  of  diagram.  When  the  flows  are  modified  in  one 
diagram,  the  flows  in  ail  other  diagrams  are  updated  when 
selected,  so  that  the  different  views  of  the  information  in 
the  data  store  cannot  be  “out  of  synchronization". 


Exhibit  10  Example  IDEFO  Diagram 


4.3  Mapping  to  Petri  Nets.  I  consider  Petri 
Nets  to  be  the  assembly  language  of  behavior.  Simple 
Petri  Nets  are  used  to  describe  sequence,  selections,  and 
concurrency  as  shown  below; 
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Exhibit  11  Petri  Net  Constructs 


Sequences  is  represented  as  follows:  a  :”token"  is 
placed  into  the  Place  1;  at  some  point,  a  transition  will 
“fire”,  resulting  in  removing  the  token  from  place  1  and 
inserting  a  token  into  Place  2.  Selection  is  indicated  by 
the  transition  removing  a  token  from  place  3  and  inserting 
tokens  into  either  place  4  or  place  5.  Concurrency  is 
indicated  by  removing  a  token  from  place  6,  and  inserting  a 
token  into  both  place  7  and  place  8.  Thus  the 
indicates  a  selection,  while  the  indicates  that  all  exiting 
places  will  be  filled.  Rejoins  of  selections  and  concunency 
are  indicated  as  shown  above. 

For  a  selection  rejoin  (indicated  by  a  “+”  on  the  upper 
side  of  a  transition),  if  there  is  a  token  in  either  place  1  or 
place  2.  when  the  transition  “fires”,  the  token  is  removed 
and  a  token  is  inserted  into  Place  3.  For  a  concurrency 
rejoin  (which  is  actually  a  synchronization  point),  when 
there  are  tokens  in  both  places  4  and  S,  a  transition  can  fire 
which  removes  both  tokens  and  inserts  a  token  into  place 
6. 


Colored  Hierarchical  Timed  Petri  Nets 
Colored  Hierarchical  Timed  Petri  Nets  are  an  extension 
of  Petri  Nets  in  which: 

•  the  tokens  can  be  specified  to  be  of  a  specific  type,  and  to 
contain  a  specific  subset  of  "data"  (e.g.,  an  index): 

•  the  “places”  are  “bags”,  i.e.,  can  contain  many  different 
tokens 

•  the  transitions  can  be  specified  to  take  into  account  the 
token  types  and  value—  this  allows  one  to  have  many 
different  tokens  on  a  graph  to  represent  processing  of 
multiple  objects,  to  specify  transitions  as  occurring  only 
with  tokens  with  the  same  index,  and  to  specify  the 
transition  in  terms  of  their  transformations  for  mapping 
the  input  tokens  into  the  values  of  the  output  tokens. 

•  a  fragment  of  a  Petri  Net  can  be  aggregated  into  a  “place" 
which  preserves  the  input  and  output  arcs  of  the  fragment 
This  allows  a  large  problem  to  be  described  using  a 
hierarchy  of  Petri  Nets. 


Consider  the  Discrete  Function  shown  in  Exhibit  3. 
One  can  describe  this  DiscrcteFunction  as  having  four 
phases  —  the  enablement  phase  (i.e.,  arrival  of  function 
enablement  and  state  item);  the  triggering  phase,  when 
one  discrete  items  A,  B,  or  C  arrives;  the  calculation  phase 
in  which  some  combination  of  the  outputs  X,  Y,  and  Z  are 
generated;  and  finally  the  exit  phase,  when  one  of  C 1  or 
C2  is  taken  (this  example  ignores  the  resource,  which 
could  be  shown  as  an  additional  branch). 

This  DiscreteFunction  F  can  be  represented  by  the 
Petri  Net  shown  below.  The  enablement  phase  is  modeled 
by  the  arrival  of  tokens  to  the  ENTRY,  and  the  token 
carrying  the  state  information.  -  a  token  would  thus  be 
placed  at  Place  2.  A  colored  token  could  be  used  to 
represent  the  arrival  of  Discreteltems  A,  B,  or  C,  and  when 
the  transition  fired,  its  content  would  be  inserted  into  Place 
1,  and  thus  the  transition  after  1  and  2  could  fire, 

resulting  in  a  token  being  inserted  into  place  3  (and  all 
those  to  the  right  of  it).  The  three  concurrent  Petri  Net 
branches  model  the  generation  of  the  outputs  (only  one  of 
these  is  numbered  for  simplicity).  The  transition  after 
Place  3  determines  whether  an  output  should  be  generated, 
and  places  a  token  in  either  Place  4  (no  output  is  to  be 
generated)  or  Place  5  ( the  output  X  is  to  be  generated). 
The  transition  after  Place  S  generates  the  output  token  X, 
and  places  a  token  in  Place  6;  in  either  case,  a  token  is 
now  placed  at  7.  When  all  of  the  branches  for  the 
calculation  of  X,  Y,  and  Z  are  completed,  then  the 
transition  after  Place  7  determines  whether  a  token  is  to  be 
inserted  into  Place  8  or  Place  9.  resulting  in  the  output  of 
a  token  corresponding  to  either  condition  Cl  or  C2. 


Exhibit  12  Example  Petri  Net  of  a  Discrete 
Function 


•  a  transition  can  be  specified  to  take  a  specif.v  .  lount  of 
time,  and  can  also  be  used  to  specify  a  “timeout”. 

The  mapping  of  the  BD  notation  onto  Colored  Peiri 
Nets  is  fairly  straightforward,  and  can  be  done  in  two  parts. 
First,  the  DiscreteFunctions  are  mapped  onto  a  Petri  Net 
Fragment;  then  the  BD  constructs  of  sequence,  selection, 
concurrency,  iteration  and  replication  are  maf^xjd  onto  Petri 
Net  fragments. 


Thus  a  DiscrcteFunction  can  always  be  exactly 
modeled  by  a  Petri  Net  with  input  places  corresponding  to 
each  of  its  input  Discrete  and  State  Items  and  its  input  arc, 
output  places  corresponding  to  the  output  Discrete  and 
Stateltems  and  exit  arcs.  In  a  similar  fashion,  ail  of  the 
major  graphic  constructs  can  be  implemented  using  Petri 
Nets.  The  result  is  that  any  Behavior  Diagram  can  be 
mechanically  translated  into  a  single  large  Petri  Net,  or 
onto  a  Hierarchical  Petri  Net.  A  consequence  of  this 
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mapping  is  that  many  of  the  proofs  about  HierarchicaJ 
Colored  Timed  Petri  Nets  can  be  applied  to  Behavior 
Diagrams. 

The  reverse  mapping  is  much  mote  difficult,  ^  .sc 
the  user  may  not  have  imposed  the  same  sort  of  regular 
sQiicture  on  the  Petri  Net  as  do  the  DiscreteFuncrions  of 
the  Behavior  Diagrams.  This  is  analogous  to  the  diflerence 
between  smictuied  code  in  a  higher  order  language  mapped 
onto  assembly  language  -  the  higher  order  strucutnng  is 
projected  out  when  the  mapping  occurs,  and  it  is  quite 
difficult  to  recreate  the  higher  order  structuring  from  only 
the  assembly  code. 

4.4  Mapping  to  Data  Flow  Diagram  Notations. 
A  Data  Flow  Diagram  was  onginaUy  developed  to  idenufy 
requirements  for  non-real  lime  software  systems.  The 
notation  is  similar  to  IDEPO  and  N-Squared  chans  in  that 
it  displays  the  flows  between  functions.  However,  a 
contains  an  additional  construct  -  the  “Data  Store",  defined 
to  contain  "state  information”.  The  Data  Store  is  defined 
in  Data  Dictionaries  as  "containing  “  a  list  of  flows  (i.c., 
those  input  to  tmd  output  from  it).  Thus  the  mapping  of 
BDs  to  DFDs  requires  the  identification  of  Data  Stores,  and 
the  establishment  of  a  "contained  by  "  relationship  between 
state  items  and  a  Data  Store  to  which  it  is  assigned.  Note 
that  this  mapping  is  not  unique,  but  can  be  made  to  be 
complete  (i.e.,  every  state  item  is  contained  in  some  Data 
Store).  With  these  definitions,  a  Data  Flow  Diagram  can 
be  constructed  from  a  BD  by: 
defining  a  DFD  process  for  each  BD  Function 
assigning  state  items  to  Data  Stores 
representing  the  inpufroutput  relationships  as  labeled 
arrows  between  the  processes  and/or  Data  Stores 

A  set  of  “Control  Flows”  between  the  processes  can 
be  obtained  fre  the  BD  graph  by  representing  each 
enablement  between  functions  as  a  Control  Flow  between 
processes.  An  example  of  such  a  mappings  are  presented 
below. 


Exhibit  12  Example  Data  and  Control  Flow 
Diagrams 


The  BD  informaiion  not  represented  on  these  diagrams 
includes  the  identificauon  of  the  "observable”  items  input 
to  and  output  from  the  system  boundary  {although  this 
might  be  represented  m  textual  dcscnptions  of  the  flows). 


the  expected  sequences  of  inputs,  funcuons,  and  outputs, 
the  replicauon  of  funcuons.  and  the  coodiuons  urxter  which 
the  control  flows  arc  generated.  The  mappings  of  the 
conditions  onto  the  control  flows  is  imprecise  because  of 
the  rcplicauons.  Hence  the  reverse  mappings  onto  BDs 
requires  the  addition  of  the  replication,  condiiions.  and 
sequencing  information  projected  out  on  the  forward 
mapping.  No  automated  mapping  seems  possible  because 
of  the  missmg  infc^mation. 

If  executable  desenpoons  arc  added  at  the  booom  level 
(c.g.,  with  Pari  Nets  as  is  done  in  ADAS),  then  the  data 
flow  model  is  transformed  into  a  set  of  inieracung 
concurrent  state  machines),  described  next 

4.5  Mapping  to  Interacting  Concurrent  Finite 
State  Machioes(CFSM).  There  are  a  large  class  of 
models  which  represent  a  system  as  a  collection  of 
Interacting  Concurrent  Finite  State  Machines  (c.g..  N/HDL. 
DFD/ADAS.  Cem  SDL.  object  models)  The  key  to 
mapping  a  BD  model  onto  this  style  of  model  lies  m  the 
fact  ihaL  when  every  TimcFuncuon  is  replaced  by  lu 
decomposition  Behavior  Diagram,  the  result  is  a  very  large 
graph  containing  a  number  of  concurrent  branches.  Every 
concurrent  branch  is  called  an  RDDFrocess  which  satisfies 
the  definition  of  a  Terminating  Finite  Sutc  Machine 
Every  RDDProcess  is  enabled  by  some  other  process, 
receives  items  from  and  sends  items  to  other 
RDDProcesses. 

To  turn  this  graph  into  a  coliccuon  of  Concurrent 
FSMs  requires  that  each  RDDProcess  be  mapped  onto  an 
FSM  “componcm".  This  yields  one  FSM  for  each 
replicated  RDDProcess.  The  interface  design  is  then 
constructed  to  represent  the  pa.ssing  of  enablements  along 
the  BD  Graph  of  RDDProcesses  as  the  passing  of 
messages  between  the  rcsulung  FSMs.  This  means  that 
each  RDD  Process,  when  terminaung  in  the  stan  of  a 
concurrency,  would  "send  an  enablement  message”  to  the 
indicated  FSMs;  the  FSMs  would  be  augmented  with  a 
function  to  accept  the  enablement  mc.ssagc  to  get  started. 
An  example  of  this  process  is  presented  in  the  figure 
below. 
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Exhibit  13  Example  Concurrent  Processes 


This  mapping  has  been  prototyped  for  VHDL.  Earh 
amving  item  is  modeled  in  VHDL  as  a  "signal",  and  the 
invocauon  of  each  DiscrctcFuncuon  in  an  RDDProcess  is 
cquivalcru  to  the  "calJ”  of  a  prexedure  m  VHDL.  Thus  the 
RDDProa;s,scs  map  to  VHDL  prcxes-scs,  and  the  sequence 
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of  functions  in  RDD  maps  to  a  sequence  of  functions  in 
VHDL. 

The  information  lost  by  this  mapping  is  the  hierarchy 
itself  (which  provides  the  reader  with  the  ability  to 
understand  the  intemkd  sequences  of  actions),  the  expected 
sequences  of  inputs,  and  the  idendfication  of  the  expected 
sequences  of  normal  processing. 

The  loss  of  the  sequencing  information  turns  out  to  be 
crucial.  Even  though  the  sequences  of  externally 
observable  inputs  and  outputs  can  be  recreated  by  hitting 
the  collection  of  FSMs  with  a  number  of  “test  vectors", 
this  process  is  extremely  time  consuming.  Moreover,  if 
each  of  only  20  FSMs  has  but  10  states,  then  the 
dimensionality  of  the  states  of  the  combined  state 
machines  is  on  the  order  to  10*  *20  —  far  to  large  to 
systematically  explore  in  a  limited  amount  of  lime  (and  the 
dimensionality  of  200  FSMs  is  on  the  order  of  10*  *200). 
In  its  original  hierarchy  form,  most  of  these  stales  arc 
observed  to  be  impossible  (i.c.,  a  state  of  an  FSM  is  not 
meaningful  until  it  is  enabled). 

Thus  the  reverse  mapping  (from  CFSMs  to  BDs) 
requires  the  effort  of  recreating  the  expected  sequences  of 
inputs  and  functions,  and  no  automated  technique  is  known 
to  exist 

4.6  Mapping  to  a  Single  Extended  State 
Machine  (SREM,  Mills  box  notation).  Both  the 
SREM  and  the  Mills  box  notations  are  variants  of 
descriptions  of  a  single  state  machine  representation  of  a 
system.  In  both  cases,  the  system  is  described  as  having  a 
single  state,  and  the  arrivad  of  a  input  item  yields  (he 
generation  of  an  output  item  and  a  transition  to  some  next 
stale.  The  state  machine  is  extended  to  allow  an  output 
'‘event”  to  become  a  subsequent  input  item.  It  is  expected 
that  an  input  will  be  completely  processed  before  the  next 
input  is  accepted. 

Mapping  the  BD  onto  a  single  state  machine  requires 
several  transformations.  First,  the  set  of  concuaent  state 
machines  is  obtained  by  recursively  substituting  BDs  for 
functions  and  hence  eliminating  the  hierarchy  constructs. 
Next,  the  concurrent  state  machines  arc  “serialized”  by  the 
addition  of  an  external  serializing  function  which  does  not 
allow  the  next  input  to  arrive  until  the  previous  input  has 
'■>een  completely  processed. 

The  multiple  state  machines  (and  their  interactions) 
could  be  viewed  using  various  object  notations.  Note  that 
each  concurrent  branch  of  behavior  has  given  rise  to  a  state 
machine  which  encapsulates  the  state  information  passed 
down  the  branch  of  processing. 

Each  of  the  multiple  state  machines  can  be  collapsed 
from  a  Moore  model  into  a  Mealy  model  by  “stateizing” 
the  location  of  the  token  indicating  which  function  is 
active.  This  is  equivalent  to  adding  a  variable  with 
enumerated  values  (i.e.,  the  names  of  the  active  functions), 
then  adding  an  initial  function  which  accepts  the  input, 
determines  the  current  state,  then  activates  the  appropriate 
function  with  appropriate  state  information. 


All  of  the  multiple  state  machines  can  then  be 
collapsed  into  a  single  state  machine  with  paruuoned  states 
(i.c..  orw  partition  per  state  machine).  If  many  rcplkaics 
are  collap^  into  a  singte  oate  machine,  then  the  interface 
function  must  determine  which  replKaie  is  at^c^xnuuc  to 
the  input  message,  and  invoke  the  appfOfMiaic  funcuon. 
This  is  equivalent  to  the  SREM  notauon.  Similarly,  a 
“box"  notation  of  Mitts  can  be  similarly  obtained. 

The  reverse  notations  share  the  same  probtems  as  the 
concurrent  state  machine  notations.  One  must  bit  the 
single  state  machine  with  a  number  of  lest  vectors  to 
"unfold"  the  processing  into  the  intended  sequences  of 
functions,  and  then  in  addition  idenufy  the  allowed 
concurrency  of  operations. 

4.7  Conclusion  >•  Automated  Mappings 
Arc  Possible,  and  Mandatory.  The  primary 
conclusion  drawn  from  the  above  is  that  it  is  possible  to 
provide  automated  support  for  the  mappings  from  the  RDD 
systems  engineering  notation  into  the  various  software 
engineering  notations.  Much  of  the  transformation  is 
automated,  but  some  require  the  addition  of  notation 
peculiar  information  (e.g.,  idcntincation  of  “control  flows" 
in  IDEFO,  identification  of  “data  stores"  for  DFDs),  but 
some  faults  can  be  automaucally  supplied  (e.g..  ail  flows 
arc  “data  flows"  for  IDEFO,  all  state  items  contained  in  an 
RDDProcess  are  assigned  to  a  default  “data  store").  The 
availability  of  such  mappings  strongly  suggesu  that  the 
use  of  RDD  at  systems  engineering  time  provides  a 
solution  to  the  "B^l  of  Notations"  problem,  i.e.,  any  of 
these  notations  can  be  obtained  by  automated  means. 

The  reverse  mappings  appear  to  be  much  more 
difficult,  requiring  the  addition  of  significant  amounts  of 
information  not  contained  within  the  software  cngmecring 
noutions.  In  fact,  the  reverse  mappings  for  the  Petri  Nets 
and  the  concurrent  finite  state  machines  require  the  re¬ 
creation  of  the  structuring  informauon  --  a  notoriously 
difficult  task. 


5.0  DISCUSSION 

Although  a  smooth  interface  is  required  from  Systems 
to  Software  engineering,  the  way  in  which  the  two 
disciplines  have  grown  in  the  past  25  years  has  not 
satisfied  this  requiremenL  A  number  of  problems  with  the 
interface  have  been  identified,  and  some  approaches  for 
solutions  have  been  discussed. 

The  mapping  of  the  RDD  Systems  Engineering 
notation  onto  several  software  engineering  noutions 
suggest  a  solution  to  the  “Babel  of  noutions"  problem.  It 
also  provides  some  valuable  insights  into  the  benefits  and 
drawbacks  of  these  software  engineering  noutions  (e.g., 
the  lack  of  hierarchy,  the  lack  of  support  for  a  separation  of 
concerns  of  normal  from  exceptional  behavior,  the  lack  of 
represenution  of  desired  sequences  of  functions  may 
provide  some  explanations  for  why  the  spccificauon  of 
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required  software  behavior  has  been  so  notoriously 
difficult}. 

Several  open  issues  have  also  been  kkntified:  the 
necessity  for  a  CBSE  discipline  to  perfcHtn  ail  of  the 
fuiwtions  of  distributed  design,  and  the  necessity  for 
software  developnneiu  methods  to  demonsoaie  {aeservaiion 
of  allocated  executable  black  box  level  behavior.  It  is 
suggested  that  software  design  methods  approaches  which 
describe  constructive  steps  for  allocating  required 
processing  onto  units  of  code  (and  simultaneously  prove 
that  the  behavior  has  been  preserved)  may  provide  a 
possible  solution  to  this  prcrt>iem. 
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Abstract 

This  paper  describes  START/ES,  an  expert  system  based  tool  for  performance  and  reliability  analysis 
of  complex  computer  systems.  START/ES  provides  an  iconic  system  design  capture  interface,  which 
allows  direct  manipulation  of  system  design  attributes  and  facilitates  the  exploradon  of  a  wide  range  of 
design  alternatives.  System  design  descriptions  are  automatically  translated  into  mathematical  models, 
which  evaluate  candidate  designs  in  terms  of  both  performance  and  reliability.  Evaluation  results  can 
be  analyzed  by  the  automated  reasoning  component  of  the  tool,  which  uses  a  rule-based  approach  to 
diagnose  performance  and  reliability  problems,  and  to  recommend  design  changes  for  achieving 
system  designs  which  arc  compliant  with  all  requirements.  Finally,  the  rule  base  is  extensible  and  can 
be  modified  using  START/ES's  rule  builder  interface. 

1.  Introduction 

Performance  and  reliability  are  key  aspects  of  system  effectiveness  which  must  be  considered  during 
the  design  of  mission-critical  computing  systems.  Automated  tools  are  needed  to  address  the 
complexity  of  design  alternatives,  and  to  provide  quantitative  evaluation  of  system  performance  and 
reliability  characteristics.  The  complexity  of  the  interactions  between  design  attributes  is  such  that 
automat^  assistance  is  helpful  in  interpreting  evaluation  results,  and  in  exploring  the  implications  of 
alternative  design  strategies. 

START/ES  (for  System  Timing  And  Reliability  Tool  -  Expert  System)  is  an  expen  system  based 
automated  tool  that  addresses  design  verification  of  mission-critical  computer  systems.  This  paper 
summarizes  the  tool's  capabilities  for  describing,  evaluating,  and  analyzing  systems. 

The  paper  is  organized  as  follows.  Section  2  provides  an  overview  of  the  primary  components  of 
START/ES.  Section  3  summarizes  the  elements  of  START/ES  system  design  representations.  Section 
4  describes  the  system  design  evaluation  results  produced  by  START/ES  p^ormance  and  reliability 
models.  Section  5  discusses  the  START/ES  expert  system  component,  and  provides  examples  of  the 
types  of  design  assistance  produced  at  various  stages  of  the  automated  reasoning  process.  Section  6 
describes  the  START/ES  rule  builder  interface  which  allows  experts  to  modify  the  automated 
reasoning  capability. 


2.  START/ES  Overview 

Figure  1  shows  the  relationship  between  the  major  components  of  STi'^  RT/ES.  A  number  of  different 
interfaces  are  provided  to  allow  interaction  with  the  tool  by  both  system  designers  and  by  performance 
and  reliability  experts. 
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System  Design  &  Requirements 


Figure  1:  ST  ARTIES  Components 


The  system  description  data  base  contains  information  related  to  the  attributes  and  tx-ganization  of 
hardware,  software,  and  functional  elements  of  a  candidate  system  design.  This  information  is  entered 
using  the  graphical  design  capture  interface  originally  developed  for  die  START  tool  for  integrated 
performance  and  reliability  analysis  [1]. 

The  performance  and  reliability  models  provide  analytic  computational  capabilities  which  transform 
system  descriptions  into  quantitative  indicators  of  system  reliability  and  performance.  These 
computational  capabilities  were  also  originally  developed  for  the  START  tool.  Top-level  results 
produced  by  the  models  are  displayed  to  the  user  as  annotations  on  system  description  drawings. 

The  expert  system  component  provides  design  assistance  using  expen  system  technology  provided  by 
the  "C"  Language  Production  System  (OLIPS)  shell  [2]  embe^ed  within  START/ES.  TTie  fact  base 
containing  system  description  and  evaluation  results  d^  is  queried  during  execution  of  the  CLIPS 
inference  engine,  which  exercises  the  automated  reasoning  process  encoded  in  the  rule  base. 

The  expert  system  rule  base  is  partitioned  into  three  rule  sets,  each  concerned  with  providing  a 
particular  ty^  of  design  assistance: 

•  Compliance  assessment  rules  determine  whether  a  candidate  system  design  meets  its 
requirements.  These  rules  compare  top-level  performance  and  reliability  results  against  the 
corresponding  requirement  values  specified  in  the  system  description. 

•  Problem  identification  rules  identify  the  sources  or  underlying  causes  of  performance  and 
reliability  non-compliance.  These  rules  examine  "causally  factored"  evaluation  results,  which 
indicate  the  contributions  to  delay  and  fails  c  likelihood  associated  with  specific  system  (fcsign 
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elements.  Elements  which  arc  found  to  be  significant  contributors  to  non-ccwnpliance  arc 
asserted  to  be  perfcmnance  and/or  reliability  problems. 

•  Design  change  recommendation  rules  identify  system  design  changes  which  will  correct 
identified  performance  and  reliability  problems,  and  thus  move  a  system  in  tte  direction  of 
compliance  with  all  of  its  requirements.  The  user  can  investigate  the  quantitative  impact  of 
recommended  design  changes  by  rc-specifying  appropriate  system  design  constructs  or 
parameters,  and  then  re-executing  the  models. 

Modification  and  extension  of  the  rule  base  is  accomplished  using  the  "rule  builder"  interface.  Tliis 
interface  allows  rules  to  be  expressed  in  an  "English-like"  manner  more  natural  to  an  expen  than  the 
pattern  matching  syntax  used  in  CLIPS.  Various  rule  base  maintenance  utilities  arc  also  provided  to 
assist  tlK  expert. 

3.  System  Design  Description 

START/ES  responds  to  the  need  for  improved  accessibility  to  performance  and  reliability  modeling 
techniques,  including  consideration  of  the  interaction  effects  between  performance  and  reliability  [3]. 
Intend^  users  include  system  designers  and  analysts  requiring  performance  and  reliability  design 
verification  of  mission-critical  computer  systems.  These  users  arc  ccaicemed  with  exploration  of  a 
broad  architectural  design  space  and  verification  of  design  alternatives.  For  this  intended  usage  the 
architectural  variant  [4]  level  of  design  description  is  us^  in  START/ES  to  expedite  the  analysis  of 
performance  and  reliability. 

The  system  description  component  of  START/ES  captures  elements  of  a  system  design  relevant  to 
evaluation  of  performance  and  reliability  characteristics.  A  graphical  interface  allows  the  user  to  specify 
the  attributes  and  organization  of  the  functional,  software,  and  hardware  components  of  the  system. 

System  Functionality  and  Requirements 

The  basic  functionality  of  a  system  is  represented  in  control  flow  diagrams  called  stimulus  control 
Jhws  (SCFs);  these  indicate  sequences  of  primitive  functions  which  are  triggered  by  arrival  of  an 
internal  or  external  stimulus,  and  which  typically  result  in  one  or  more  corresponding  outputs. 
Primitive  functions  are  connected  by  directional  flow  connectors  which  indicate  the  amount  of  data  that 
is  exch^ged  between  functions.  When  multiple  flow  connectors  emanate  from  a  single  function,  path 
probabilities  are  assigned  to  these  connectors  to  indicate  the  relative  likelihood  that  each  path  is  tiken 
by  an  arriving  stimulus. 

System  perfcamance  and  reliability  are  measured  relative  to  sequences  of  primitive  functions  delineated 
within  SCFs  by  stimulus-response  markers.  The  serial  set  of  processing  activities  within  the  scope  of  a 
particular  stimulus-response  maricer  is  referred  to  as  a  response  thread.  Figure  2  illustrates  the 
graphical  user  interface  that  is  used  to  define  SCFs  and  stimulus-response  markers. 

System  performance  and  reliability  requirements  arc  specified  for  response  threads  and  for  system 
resources  (hardware  (fcvices  and  tasks).  A  response  time  requirement  and  an  availability  requirement 
are  specified  for  each  response  thread.  Utilization  limits  are  specified  for  each  hardware  device  and 
task  in  the  system  description. 

Software  and  Hardware  Components 

Each  primitive  function  within  an  SCF  is  attached  to  a  software  module  which  indicates  the  types  and 
quantities  of  computing  services  needed  to  implement  the  function.  Software  modules  are  grouped  into 
dispatchable  units  of  software  called  tasks.  The  software  functionality  of  the  system  is  mapped  onto 
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hardware  by  allocating  tasks  to  specific  hardware  devices  appearing  in  a  hardware  architeciure 
diagram. 


Figure!:  Stimulus  Control  Flow 


The  hardware  architecture  diagr^  identifies  all  of  the  processors,  communication  devices,  and  storage 
devices  used  in  the  system,  and  indicates  how  these  are  interconnected.  Figure  3  illustrates  the 
graphical  interface  that  is  used  to  define  hardware  architectures. 

Table  1  lists  basic  performance  and  reliability  parameters  that  are  specified  for  each  hardware  device  in 
the  hardware  architecture  diagram. 

In  addition  to  device  processing  rates,  system  performance  is  affected  by  system  service  overhead  rates 
for  intertask  communication,  intercomputer  communication,  and  data  access;  these  are  represented  as 
attributes  of  operating  systems  specific  for  each  processor,  communication  protocols  specified  for 
each  communication  device,  and  processor  overheads  specified  for  each  storage  device. 

The  hardware  redundancy  parameters  specified  for  each  device  define  a  hardware  subsystem 
consisting  of  a  set  of  identical  units  which  are  designated  cither  active  or  backup.  It  is  assumed  that  the 
total  processing  load  on  the  subsystem  is  shared  by  all  active  units,  and  that  backup  units  are  held  in 
reserve  to  replace  active  units  that  fail.  In  each  subsystem,  detection  latency  represents  the  mean  time  to 
detect  the  failure  of  a  unit,  while  recovery  time  is  the  mean  time  to  bring  an  inactive  unit  into  the  active 
configuration. 

The  transient  error  rate  specified  for  each  device  indicates  the  rate  at  which  faults  occur  during  use  of 
the  device  which  cause  incorrect  outputs  to  be  produced,  but  which  do  not  require  physical  repair 
actions  to  be  performed.  Subsystem  error  recovery  options  include  temporal  repetition,  in  which 
operations  are  repeated  "temporally”  on  each  active  unit  until  two  matching  outputs  are  produced,  and 
N-modular  redundancy  (NMR),  in  which  operations  are  replicated  in  parallel  on  each  active  unit,  and  a 
voting  mechanism  is  us^  to  detect  and  mask  erroneous  outputs. 


306 


Figure  3:  Hardware  Architecture  Diagram 


Table  1:  Performance  and  Reliability  Parameters 


Processing  rate 

Processors  -  Speed  (in  MIPS),  muKipliaty 

Communication  Dewces  -  Transfer  rate,  multiplicity 

Storage  Devices  -  Transfer  rate,  access  latency,  multiplicity 

Reliability 

Mean  time  to  failure  (MTTF),  transient  error  rate 

Maintainability 

Mean  time  to  restore  (MTTR) 

Hardware 

Number  of  active  units,  number  of  backup  units 

redundancy 

Detection  latency,  recovery  time 

Error  detection 

Temporal  repetition  (Yes/No) 

and  correction 

N-modular  redundancy  (Yos/No) 

Hardware  subsystems  can  be  used  to  represent  many  different  configurations,  each  of  which  provides 
some  degree  of  standby  redundancy,  extra  processing  capacity,  or  error  detection  and  correction 
capability.  Tabic  2  indicates  parameters  that  would  be  specifi^  for  some  commonly  used 
configurations. 

START/E3  also  provides  a  means  to  account  for  the  unreliability  of  software.  A  mean  executions 
between  failure  (MEBF)  parameter  may  be  specified  for  each  software  module,  indicating  the  mean 
number  of  invocations  of  that  module  between  functional  failures  caused  by  a  software  design  fault 
(5]. 
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Table  2:  Hardware  Subsystem  Configurations 


Configuration 

Hardware  Subsystem  Specification 

Simplex  system  (no  redundancy) 
Oupilex  system 

Triple  m<^ular  redundant  system 
Hybrid  redundant  system 
Loadsharing  system 

Standby  redundant  system 
Master/slave  system 

Number  active  »  1  Number  backup  «  0 

Number  active  -  2  Number  backup  -  0  NMR  selected 

Number  active  •  3  Number  backup  >  0  NMR  selected 

Number  active  a  2  Number  backup  Z 1  NMR  selected 

Number  active  2  2  Number  backup  -  0 

Number  active  2 1  Number  backup  2  1 

Number  active  >  1  Number  backup  »  1 

4.  System  Design  Evaluation 

START/ES  automatically  translates  system  design  descriptions  into  mathematical  models,  then 
executes  these  models  to  obtain  measures  of  system  performance  and  reliability.  Behavio^ 
interrelationships  between  performance  and  reliability  —  such  as  service  degradation  due  to  permanent 
hardware  failure  ("performability")  and  utilization  of  processing  resources  by  error  recovery 
mechanisms  —  are  accounted  for  through  parametric  interchange  between  the  models  of  each  type.  The 
primary  performance  and  reliability  measures  that  are  computed  are  as  follows: 

•  Thread  service  time  -  the  total  elapsed  execution  time  during  realization  of  a  response  thread. 
This  result  represents  the  minimum  achievable  response  time,  given  the  service  demands  within 
the  thread  and  the  inherent  service  rates  of  the  hardware  devices  used  in  fulfilling  those 
demands.  System  overheads  associated  with  intertask  communicador ,  communication  protocol 
processing,  transient  error  recovery,  and  other  services  are  applied  as  appropriate  in  calculating 
thread  service  times. 

•  Device  utilization  ~  the  total  (offered)  load  placed  on  a  hardware  device.  For  each  hardware 
device,  this  result  is  defined  as  the  total  of  ^1  service  demands  (per  unit  time)  on  the  device 
divided  by  the  device's  service  rate,  and  is  expressed  as  a  percentage.  Values  greater  than  100% 
indicate  an  unstable  situation  in  which  a  device  has  insufficient  capacity  to  handle  the  load  placed 
on  it;  in  the  long  run,  queues  for  such  a  device  will  grow  without  bound. 

•  Thread  availability  -  the  probability  that  a  thread  is  completed  successfully.  Successful 
completion  requires  that  operational  hardware  devices  are  available  to  accept  all  constituent 
service  demands,  and  that  no  uncorrected  transient  errors  or  software  failures  occur  during 
realization  of  the  thread. 

Additional  performance  and  reliability  results  that  are  calculated,  and  which  can  be  displayed  at  the 
user's  option,  include: 

•  Function  service  time  -  the  total  elapsed  service  time  during  completion  of  all  service  demands 
within  a  single  primitive  function  in  an  SCF  diagram. 

•  Device  availability  -  the  probability  (at  an  arbitrary  instant  of  time)  that  a  particular  device  is  in  an 
operational  state.  When  a  device  consists  of  a  multi-unit  subsystem,  this  availability  result 
represents  the  probability  that  at  least  one  of  the  units  within  the  redundant  configuration  is 
available  to  accept  arriving  service  demands. 
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Figure  4  shows  an  SCF  diagram  on  which  response  thread  service  time,  response  thread  availability, 
and  function  service  time  results  are  displayed.  Figure  5  shows  a  hardware  architecture  diagram  in 
which  each  device  is  annotated  with  device  utilization  and  device  availability  results. 
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Figure  4:  Thread  Service  Time  and  Availability  Results 


Figure  5:  Device  Utilization  and  Availability  Results 
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Causally  Factored  Performance  and  Reliability  Results 

The  automated  reasoning  component  of  START/ES  requires  not  only  top-level  measures  of 
performance  and  reliability  (e.g.,  for  assessing  compliance  with  requirements),  but  also  lower-level 
measures  which  indicate  the  contributions  of  individual  design  components  and  attributes  to  response 
thread  delays  and  failure  likelihoods.  Such  measures  are  needed  during  the  problem  identification 
reasoning  process,  which  attempts  to  isolate  the  root  causes  of  performance  and  reliability  shortfalls. 

In  order  to  obtain  this  information,  causally  factored  performance  and  reliability  data  is  collected  during 
execution  of  the  design  evaluation  algorithms. 

Factored  perfomance  results.  During  performance  computations,  START/ES  traces  through  control 
flow  diagrams  in  order  to  accumulate  the  total  service  time  within  the  scope  of  each  response  thread, 
and  the  total  utilization  of  each  hardware  device.  These  accumulations  are  based  on  "atomic"  service 
time  events  that  arc  implied  by  individual  service  demands  occurring  within  the  control  flows.  For  each 
atomic  service  time  event  encountered,  a  data  record  containing  various  numerical  values,  as  well  as 
qualitative  information  regarding  the  context  within  which  the  service  time  event  occurred,  is  created 
and  stored.  When  performance  computations  are  complete,  the  set  of  all  of  these  reccffds  serves  as  a 
data  base  of  factoid  performance  results,  from  which  the  contributions  to  thread  service  times  and 
device  utilizations  associated  with  various  design  elements  can  be  reconstructed  during  execution  of  the 
automated  reasoning  component. 

Specific  context  elements  attached  to  service  time  values  include  the  dispatchable  unit  of  software 
(task)  within  which  the  service  time  event  occurred,  and  the  type  of  application  or  system  overiiead 
operation  being  performed  (e.g.,  instruction  execution,  intertask  communication,  intercomputer 
communication,  remote  data  access,  local  data  access).  Due  to  the  emphasis  of  START/ES  on 
reasoning  about  performance/rcliability  tradeoffs,  it  is  essential  that  processing  delays  directly  related 
to  the  use  of  fault  tolerance  capabilities  be  identified;  for  example,  the  c  .d  element  "transient  error 

recovery"  is  attached  to  service  times  incurred  during  replicated  operations  associated  with  error 
detection  and  correction. 

Figure  6  illustrates  the  perfomiance  data  collection  process.  The  information  recorded  for  each  service 
time  event  includes  the  service  time  value  itself,  the  applicable  flow  arrival  rate,  the  device  on  which 
the  event  occurred,  and  the  attached  context  information.  Note  that  the  elements  of  context  attached  to 
individual  service  time  values  may  be  embedded  within  one  another,  for  example  a  remote  data  access 
(a  "GET"  operation)  which  includes  intercomputer  communication  (ICC)  between  two  processors  will 
have  both  the  "GET"  and  "ICC"  context  elements  attached. 

The  context-based  recording  scheme  used  for  factored  data  collection  provides  a  great  deal  of  flexibility 
tor  generation  of  the  basic  facts  upon  which  the  expert  system  oj^rates.  New  elements  of  context  are 
easily  accommodated,  and  since  the  context  attach^  to  each  service  time  value  is  of  arbitrary  length, 
contextual  information  can  be  recorded  to  any  level  of  depth. 

Factored  reliability  results.  Thread  availability  for  a  particular  response  thread  is  defined  as  the 
probability  that  all  constituent  service  demands  within  the  thread  are  successful,  which  in  turn  requires 
that  all  of  the  following  are  true;  (1)  all  service  demands  arc  successfully  accepted  on  an  operational 
unit  within  a  hardware  subsystem,  (2)  no  uncorrected  transient  errors  occur  during  execution  on  any 
operational  unit,  and  (3)  no  software  failures  occur  during  execution  of  any  software  module  invoked 
by  the  thread. 

During  thread  availability  calculations,  the  following  are  recorded  for  each  response  thread:  ( 1 )  the  set 
of  hardware  subsystems  required,  (2)  the  probability  of  one  or  more  transient  errors  occurring  during 
execution  on  each  subsystem  unit  actually  used,  and  (3)  the  probability  of  one  or  more  software 
failures  occurring  during  execution  of  each  software  module  invoked.  This  information,  together  with 
availability  results  calculated  for  each  hardware  subsystem,  allow  overall  thread  availability  to  be 
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calculated  At  the  same  time,  thread  availability  results  are  factored  according  to  the  relative 
contributions  of  the  basic  sources  of  functional  failure,  or  "failure  classes";  hardware  unavailability  in 
hardware  subsystems,  uncorrected  transient  errors  on  active  units,  and  software  failures  in  software 
modules. 
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Figure  6:  Context-Based  Data  Collection 


The  unavailability  contribution  of  each  failure  class  to  the  total  unavailability  of  a  given  response  thread 
is  defined  as  the  total  amount  of  unavailability  in  the  thread  which  could  be  reduc^  by  decreasing 
occurrences  of  failures  of  that  tyjw,  i.e.  as  the  difference  between  the  availability  of  the  thread  given 
that  failures  of  that  type  do  not  occur,  and  the  availability  of  the  thread  as  calculated  These 
unavailability  contributions  are  normalized  so  that  the  set  of  contribution  factors  assigned  to  the  various 
failure  classes  sum  to  unity: 


Contribution  factor  (Class  i)  = 


Unavailability  contribution  (Class  i) 
^  Unavailability  contribution  (Class  j) 


j 


Hardware  unavailability  results  are  further  factored  to  quantify  the  contributions  of  three  basic  causes 
of  hardware  unavailability.  Maintenance  downtime  is  unavailability  due  to  all  units  in  a  device 
subsystem  being  in  repair.  Reconfiguration  downtime  is  unavailability  due  to  switchover  of  backup 
units  into  active  service.  Detection  latency  is  unavailability  caused  by  attempts  to  assign  service 
demands  to  a  unit  which  has  failed  but  whose  failure  has  not  yet  been  detected  by  the  system.  The 
contribution  factors  assigned  to  each  of  these  failure  subclasses,  as  a  proportion  of  the  contribution 
factor  assigned  to  hardware  unavailability  for  a  particular  device  subsystem  as  a  whole,  are  determined 
from  the  steady-state  solution  to  the  mathematical  availability  model  upon  which  subsystem  availability 
results  are  based: 
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I  Pi 

Contribution  factor  (Subclass  k)  =  •  -  x  Contribution  factor  (Hardware  unavailability) 

(1  -  As) 

where  Pj  =  Steady- state  probability  associated  with  subsystem  state  j 

Sk  =  set  of  subsystem  states  corresponding  to  type  k  unavailability 
and  As  =  subsystem  availability 


5.  Expert  System 

The  START/ES  automated  reasoning  component  is  implemented  using  the  CLIPS  expert  system  shell. 
System  description  and  design  evaluation  results,  represented  as  CLIPS  facts,  are  queried  during 
execution  of  the  rule  base,  which  is  also  encoded  in  CLIPS. 

The  rule  base  is  partitioned  into  three  rule  sets.  These  rule  sets  are  designed  to  be  executed 
sequentially,  so  ^at  the  information  generated  during  each  reasoning  stage  can  be  accumulated  and 
us^  during  subsequent  stages.  The  results  of  each  stage  are  also  displayed  to  the  user. 

Reasoning  Stage  1:  Compliance  Assessment 

Compliance  assessment  rules  compare  top-level  performance  and  reliabiUty  results  for  each  response 
thread  to  the  corresponding  user-specified  requirement  values.  A  system  is  asserted  to  be  non- 
compliant  if  at  least  one  response  thread  in  the  system  is  non-compliant  After  all  defined  response 
threads  have  been  checked  for  compliance  --  which  in  a  system  containing  many  critical  functions  may 
involve  a  large  number  of  comparisons  -  a  list  of  non-compliant  threads  is  displayed  to  the  user,  as 
illustrated  in  Figure  7. 


P  1— — — EKpert  System  Results 
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Thread  Real  time  telemetry  reapotne  ia  noncompliant  vrt  availability 

Thread  Emergency  Commend  Reaponae  ia  noncompliant  vrt 
device  Datab^  Diak  utilization 
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Figure  7:  Compliance  Assessment  Results 


A  response  thread  is  considered  to  be  non-compUant  with  its  performance  requirements  if  either  (1)  the 
computed  thread  service  time  exceeds  the  thread  response  time  requirement,  or  (2)  at  least  one  device 
or  task  used  by  the  thread  exceeds  its  resource  utilization  limit  In  the  first  case,  the  thread  is  asserted 
to  be  non-compliant  with  respect  to  service  time,  and  in  the  second  case,  the  thread  is  asserted  to  be 
non-compliant  with  respect  to  utilization  of  the  device  or  task.  The  reasoning  behind  this  approach  is 
that  if  thread  service  time  exceeds  the  purred  response  time,  then  the  thread  cannot  possibly  meet  its 
response  time  requirement.  However,  if  ^e  service  time  is  less  than  the  response  time  requirement,  the 
thread  may  still  fail  to  meet  its  requirement  --  due  to  contention  effects  -  if  the  utilization  of  one  or 
more  resources  used  during  execution  is  high.  This  approach  allows  compliance  with  performance 
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requirements  to  be  assessed  using  an  analytic  modeling  approach  which  avoids  explicit  analysis  of 
resource  contention  effects. 

A  response  thread  is  considered  to  be  non-compliant  with  respect  to  availability  if  the  calculated 
availability  of  the  thread  —  indicating  the  probability  that  all  service  demands  within  the  thread  are 
completed  successfully  -  ^  less  than  the  thread  availability  requirement. 

Reasoning  Stage  2:  Problem  Identification 

For  each  non-compliant  response  thread  identified  during  the  compliance  assessment  reasoning  stage, 
problem  identification  rules  examine  causally  factored  design  evaluation  results  to  determine  the  most 
significant  contributors  to  performance  and  reli^ility  shortfalls.  Because  the  context  information 
attached  to  causally  factor^  results  is  of  arbitrary  depth  and  complexity,  the  number  of  different 
design  elements  which  must  be  considered  as  potential  contributors  may  be  quite  large.  The  ability  to 
isolate  system  components  and  design  features  that  are  most  critical  in  terms  of  meeting  requirements  -- 
from  among  such  a  large  set  of  potential  elements  —  is  perhaps  the  most  useful  aspect  of  the  problem 
identification  results  provided  by  the  expert  system  to  the  user  (Figure  8). 


^1  Expert  System  Results  - —  . 
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WRITE  (OatabMo  Oiok)  to  a  device  utilization  problem  for 
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Hardware  unavailability  on  device  Telemetry  and  Command  Proceaaor 
ia  an  availability  problem  for  thread  Real  time  telemetry  reaponae 

Maintenance  downtime  on  device  Comm  Proceaaor 
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Figure  8:  Problem  Identification  Results 


Given  that  a  particular  response  thread  is  non-compliant  with  respect  to  either  service  time  or  utilization 
of  a  resource,  design  elements  associated  with  a  large  percentage  of  the  amount  of  non-compliance  arc 
identified  as  service  time  problems  or  as  utilization  problems.  Design  elements  which  may  be  identified 
include  individual  hardware  devices,  fault  tolerance  overheads,  and  system  service  overheads  such  as 
intertask  communication  on  a  processor,  intercomputer  communication  between  two  processors, 
remote  data  access  between  a  processor  and  a  data  file,  and  local  data  access  on  a  storage  device. 

Given  that  a  particular  response  thread  is  non-compliant  with  respect  to  availability,  failure  classes 
which  contribute  most  significantly  to  thread  unavailability  arc  identified  as  availability  problems.  The 
significance  of  failure  class  contributions  is  assessed  based  on  the  unavailability  contribution  factors 
computed  for  the  thread.  Failure  classes  which  may  be  identified  include  hardware  unavailability. 
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maintenance  downtinjc,  reconfiguration  downtime,  detection  latency,  or  uncorrcctcd  cransicm  errors  m 
a  hardware  device,  axKi  software  failures  in  a  software  moduk. 

Reasoning  Stage  3:  Design  Change  Recommendation 

Design  change  rccommendariem  ruks  employ  expen  judgements  in  attempting  to  rccomn^nd  design 
changes  which  will  reduce  the  ma^itude  erf  one  or  more  kkntifsed  prcrfilems.  and  thus  move  the 
system  in  the  direction  of  compliance  with  its  requirements.  A  primary  goal  of  the  rccommcndatioos 
produced  is  to  assist  the  designer  in  understanding  the  sensitive  traders  that  attend  various  design 
choices,  including  the  subtle  interactions  between  performance  and  reliability  bchaviw.  Figure  9 
illustrates  a  set  of  rccranmcndaiions  produced  by  the  expen  system;  these  recommendations  inclutk 
design  changes  in  both  the  perfrarnance  and  reliability  areas,  and  address  inherent  devkc 
chai^teristics  as  well  as  system-level  architectural  issues. 


Ewpert  Sytlgm  WatuHt 
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Rccommand  paaifbla  rtitructurtngar  data  tnCCS&S  Data  Start  to  raduca  rtthcr  the 
number  af  accatjaa  ar  tha  valuma  af  data  to  accaat 


Figure  9:  Design  Change  Recommendaxions 


The  design  change  recommendation  rule  base  can  be  divided  conceptually  into  six  areas,  each  of  which 
pertains  to  a  particular  set  of  system  design  issues: 

•  The  hardware  characteristics  design  area  is  concerned  with  inherent  hardware  device 
performance  and  reliability  attributes.  When  performance  problems  have  been  traced  to  a 
particular  device,  rules  in  this  area  suggest  improvements  to  such  characteristics  as  device 
processing  speed  and  multiplicity.  Similarly,  when  reliability  problems  have  been  traced  to  a 
particular  device,  improvements  to  the  basic  device  attributes  Mean  Time  to  Failure  (MTTF)  or 
Mean  Time  to  Restore  (MTTR)  arc  recommended. 

•  The  task  structure  design  area  is  concerned  with  packaging  of  software  into  tasks,  which  are  the 
lowest  level  of  concurrency  represented  in  START/ES.  Rules  in  this  area  focus  on  such 
performance  issues  as  the  degree  of  parallelism  that  can  be  achieved  through  various  partitioning 
schemes,  versus  the  amount  of  intertask  communication  overhead  incurred  in  each  case. 
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•  The  Junctional  and  data  allocation  design  area  is  concerned  wich  the  assignment  of  tasks  and  data 
files  to  hardware  devices.  Alternative  assignment  schemes  may  affect  many  difiereni 
pwforaiance  indicators,  including  thread  service  times,  device  utilizations,  and  inicrcwnputer 
and  intertask  communication  ov^cads.  Task  and  file  allocations  may  also  affect  reliability,  in 
that  alternative  assignments  may  introduce  different  sources  of  unreliability  into  response 
threads  which  require  the  use  of  the  devices  to  which  tasks  and  files  are  assigned. 

•  The  hardware  subsystem  structure  design  area  is  coKcmcd  with  the  pcrfcmnancc  and  reliabihiy 
implications  of  alternative  hardware  redundancy  structures  and  redundancy  management 
capabilities.  Rules  in  this  design  area  address  the  selccticm  of  subsystem  design  parameters  -- 
including  the  levels  of  active  and  standby  redundancy  employed  -  which  will  achieve  the  best 
overall  balance  between  subsystem  performance  arxl  reliability  effects. 

•  The  transient  error  recovery  design  area  is  concerned  with  the  pcrformaiKc  and  reliability 
implications  of  alternative  transient  error  detection  and  correction  sclHrmes.  Rules  in  this  area 
attempt  to  balance  the  amount  of  coverage  provided  for  transient  errors  with  the  amount  of 
additional  processing  time  and  resource  utilization  caused  by  replicated  ofrerations. 

•  The  software  reliability  design  area  is  concerned  wiih  the  effects  of  inherent  software  module 
reliability  on  response  thread  unavailability. 

As  an  example  of  design  change  recommendation  reasoning,  consider  a  hardware  subsystem  which 
has  been  identified  as  a  utilization  problem.  If  this  subsystem  contains  at  least  one  backup  unit,  then 
the  effective  load  on  the  subsystem  can  be  reduced  by  using  one  or  more  of  these  backups  in  active 
mode.  However,  doing  so  may  increase  the  overall  failure  rate  of  the  subsystem,  and  thus  increase 
overall  downtime.  Thus,  provided  that  maintenance  downtime  on  the  subsystem  is  not  already  an 
availability  problem,  it  is  recommended  in  this  case  that  a  backup  unit  be  used  for  active  processing: 

IF  Device  D  is  a  utilization  problem;  and 

Maintenance  downtime  on  Device  D  is  not  an  availability  problem ;  and 
Number  of  backup  units  in  Device  D  subsystem  ^  1 ; 

THEN  Recommend  changing  one  or  more  Device  D  subsystem  backup  units  to  active  units  to 
increase  service  capacity. 

As  another  example  of  a  design  change  recommendation  rule,  consider  a  device  on  which  uncorrected 
transient  errors  are  an  availability  problem.  If  there  is  cmly  one  active  unit  in  the  device  subsystem 
(thus  precluding  the  use  of  N-modular  redundancy  with  the  existing  hardware),  then  it  is 
recommended  that  a  temporal  repetition  scheme  be  used  to  detect  and  correct  errors.  However,  since 
using  that  scheme  will  significantly  increase  service  times  on  the  device,  this  recommendation  is  only 
made  if  the  device  is  not  already  a  service  time  problem: 

IF  Uncorrected  transient  errors  on  Device  D  is  an  availability  problem;  and 

No  error  recovery  scheme  is  selected  on  Device  D;  and 
Number  of  active  units  in  Device  D  subsystem  =  1 ;  and 
Device  D  is  not  a  service  time  problem; 

THEN  Recommend  using  temporal  repetition  on  Device  D  to  reduce  failures  caused  by 
uncorrected  transient  errors. 
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6.  Rule  Builder  Interface 


Recent  experience  with  a  rule-based  expen  system  tool  for  perfamance  analysis  (6, 7]  indicated  that 
the  utility  of  such  a  tool  is  greatly  enhanced  if  the  capability  to  specialize,  extend,  <»■  otherwise  modify 
tlw  rule  base  is  provided  for  the  expert  user.  For  START/1&,  this  capability  is  provided  for  the  design 
change  recommendation  portion  of  the  rule  base.  Access  to  these  rules  is  fw^ovided  by  the  rule  builder 
interface. 

The  rule  builder  interface  is  based  on  an  "English-likc"  rule  language  that  allows  rules  to  be  expressed 
in  a  manner  more  natural  to  the  expert  than  the  pattern  matching  syntax  used  in  QlIPS.  The  translation 
from  rule  language  forms  to  internal  CLIPS  rule  rcpmseniations  is  handled  automatically.  Figure  10 
illustrates  the  dialog  used  to  edit  rule  clauses,  in  which  "pop-up"  menus  arc  used  to  provide  access  to 
alternative  selections  for  various  rule  components. 


•F  Iftarduiare  uneuahablUtg  oiyjlhardunif  deulo  to  ba  cail«dll 

p  lit  an  auallabllHu  probleml|fort 

[thread  to  be  caltedH  t 

Done  1 


Figure  10:  Rule  Editing  Dialog 


The  rule  builder  interface  is  supported  by  various  utilities  designed  to  assist  the  expert  user.  These 
include  the  ability  to  search  the  rule  base  for  specified  character  strings,  the  ability  to  create,  load,  and 
save  alternative  rule  bases,  and  the  ability  to  stop  and  resume  expert  system  execution  at  the  point  at 
which  a  specified  rule  fires. 

As  an  example  of  the  way  in  which  the  expert  reasoning  capability  may  be  extended,  consider  a 
situation  in  which  several  response  threads  do  not  meet  their  availability  requirements,  and  in  which 
several  different  devices  have  been  identified  as  availability  problems.  Currently,  the  design  chan^ 
recommendation  rule  base  will  recommend  that  the  inherent  reliability  of  all  of  ^sc  devices  be 
increased.  A  natural  extension  would  be  to  identify  devices  whose  unreliability  is  particulariy 
problematic  -  and  which  are  therefore  prime  candidates  for  improvemenL  Fot  example,  a  rule  could  be 
formulated  which  would  find  devices  that  are  availability  problems  for  more  than  one  response  thread, 
and  recommend  improving  the  reliability  of  these  devices  first,  before  consictering  those  which  affect 
only  a  single  thread. 

Conclusions 

START/ES  provides  automated  interpretation  of  performance  and  reliability  model  results.  Using 
expert  system  technology,  it  identifies  problems  and  recommends  design  changes.  Initial  evaluation 
with  small  scale  system  designs  has  shown  the  potential  value  of  the  expert  system  based  approach. 
START/ES  has  bwn  able  to  process  causally  factored  model  results  and  isolate  the  causes  of  non- 
compliance  with  performance  and  reliability  requirements.  Its  recommendations,  when  interpreted  by 
the  user  in  the  context  of  the  current  system  design,  provide  valid  guidance  in  converging  on  a  fully 
compliant  design. 
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At  the  same  time,  early  use  has  revealed  the  need  for  further  research  and  enhancements.  In  the  area  of 
system  description,  an  ability  to  define  specialized  system  design  elements  and  their  perfamamx  and 
reliability  belmviors  would  be  a  valuable  addition.  In  tl^  area  of  performance  models,  a  simulation 
capability  is  needed  to  evaluate  contention  effects  more  robustly  than  the  current  analytic  moikl.  In  the 
area  of  the  expert  system,  the  rule  builder  needs  to  be  extended  to  access  all  system  tkscription 
elements  and  model  results.  Tlw  rule  language  needs  to  be  more  compretensive  in  terms  of  its  logical 
expressiveness.  Finally,  START/ES  needs  to  be  evaluated  on  real-world  problems  trf  scale.  These 
issues  represent  future  work  related  to  START/ES  and,  more  generally,  expert  system  applications  to 
performance  and  reliability  mcxkling. 
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Abstract 

An  overview  of  a  formal  approach  is  presented  for  mapping  a  specification  of  a 
real-time  system  onto  a  design  space  of  abstract  “locations,”  consisting  of  hardware, 
software,  communication  components  and  human  interfaces.  The  specification  is  as¬ 
sumed  to  be  given  in  terms  of  a  “hierarchical  multi-state  (HMS)  machine,”  which 
is  obtained  by  integrating  an  advanced  type  of  state  model  with  an  interval-based 
temporal  logic.  Formal  verification  techniques  and  correctness-preserving  or  partially 
correctness-preserving  transformation  methods  are  two  of  the  promising  approaches  for 
maintaining  properties  of  a  specification  in  the  transition  to  design.  The  location  con¬ 
cept  also  provides  the  means  for  deriving  data  exchanges  and  performance  requirements 
on  the  design  components  of  a  system  as  progress  is  achieved  towards  implement  .t  ion. 

Keywords  —  Real-time  systems,  specification,  design  structuring,  allocation  of  re¬ 
quirements,  system  modeling. 


1  Introduction 

While  numerous  specification  and  design  methods  for  real-time  systems  have  been  investi¬ 
gated  in  the  recent  past,  little  progress  has  been  made  in  formalizing  the  transition  from 
specification  to  d^ign  in  a  way  that  guarantees  the  preservation  of  temporal  properties. 
Standard  design  methodologies  for  real-time  systems  (e.g.,  |HP87,  WM86])  depend  entirely 
on  informal  methods,  usually  based  on  finite-state  machine  extensions  of  data  flow  diagrams. 
Since  finite-state  machines  easily  become  intractable  for  even  relatively  simple  systems,  such 
representations  are  usually  confined  to  the  definition  of  simple  local  control  conditions,  with 

•This  work  was  supported  in  part  by  the  Office  of  Naval  Research  under  contract  N0{)O14-92-C-OO'17. 
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no  hope  of  offering  even  a  simulation- based  analysis  of  behavior.  Other  methods  sudi  as 
DCDS  [A185|  have  addressed  the  simulation  aspects,  while  various  formal  methods  for  spec¬ 
ification  and  verification  of  real-time  systems  have  been  proposed  in  the  literature  (see,  e.g., 
(Ga91aJ  and  accompsuiying  articles).  Executable  formal  specification  methods  usually  also 
provide  simulation  capabilities.  However,  it  is  well-l:nown  that  simulation  is  simply  inade¬ 
quate  for  guaranteeing  correctness. 

The  key  stages  in  the  transition  from  specification  to  design  are  (1)  the  struciurijig  or 
partitioning  of  the  specification  space,  (2)  the  definition  of  a  design  space,  (3)  the  definition 
of  a  mapping  from  the  specification  space  to  the  design  space,  and  (4)  the  formal  derivation 
of  design  requirements  from  the  specification  requirements.  Success  in  the  development  of 
a  formal  basis  for  these  four  stages  depends  on  the  ability  to  define  precise  mathematical 
formulations  of  the  specification  and  design  spaces  and  the  mappings  between  them. 

The  purpose  of  this  paper  is  to  present  some  preliminary  ideas  on  the  use  of  the  “hier¬ 
archical  multi-state  (HMS)  macliines”  (GF88,  GI90,  GF91,  G190,  Ga91aj  in  formalizing  the 
transition  from  specification  to  design.  HMS  machines  are  obtained  by  integrating  parallel 
and  hierarchical  automata  with  a  temporal  interval  logic,  called  TIL,  to  provide  a  formal 
methodology  for  specifying  the  behavior  and  requirements  of  hard  real-time  systems.  Several 
independent  formal  methods  for  verifying  “safety  properties”  of  HMS  machine  specifications 
have  been  developed  so  far.  The  correctness-preserving  transformations  of  (FG89|  and  the 
model-based  theorem  proving  of  [G191,  Ga91bl  provide  refutation  based  verification  capa¬ 
bilities  that,  in  general,  avoid  the  need  for  complete  enumeration  of  behavior.  The  model 
checking  of  (GI92j  and  the  interacting  computation  graphs  of  (GJ9]]  offer  manageable  ap¬ 
proaches  to  enumerative  verification. 

To  deal  with  the  transition  to  design,  we  propose  the  partitioning  of  the  state  space  of 
an  HMS  machine  representing  the  behavior  of  a  system  into  a  set  of  “locations.”  A  location 
is  an  abstraction  of  a  piece  of  hardware,  a  software  program,  a  communication  medium  or  a 
human  interface.  Since  an  implementation  creates  its  own  requirements  beyond  the  system 
requirements,  a  process  of  refinement  is  usually  necessary  that  expands  the  specification 
and  creates  an  extension  of  the  original  HMS  machine.  In  addition,  two  other  elements 
of  an  HMS  machine  must  be  mapped  into  the  space  of  locations:  (1)  “transitions”  that 
define  what  changes  in  states  can  occur  in  an  HMS  machine,  and  (2)  “controls”  consisting 
of  TIL  predicates  that  define  constraints  on  transitions.  The  derivation  of  data  interchanges 
and  temporal  requirements  on  the  design  space  is  derived  essentially  automatically  once  the 
space  of  locations  is  defined.  The  key  problem  that  remains  is  to  verify  that  the  refinement 
satisfies  the  requirements.  This  can  be  accomplished  in  several  ways.  One  approach  is  to 
employ  transformations  such  as  those  in  {FG89,  GI92]  that  preserve  or  partially  preserve 
behavior.  Another  method  is  to  verify  formally  that  key  safety  properties  are  praserved 
in  the  refined  machine  using  one  of  the  available  verification  methods  for  HMS  machines. 
A  third  approach  not  considered  here,  which  is  commonly  used  in  the  context  of  process 
algebra,  is  bisimulation,  in  which  one  attempts  to  prove  that  the  two  specific.ations  have 
identical  behaviors. 

In  Section  2  of  this  paper  we  present  an  overview  the  basic  concepts  of  HMS  machines 
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and  in  Section  3  we  employ  the  specification  of  a  simple  railroad  operation  to  give  an  outline 
of  our  approach  for  transitioning  a  specification  to  a  design.  In  the  process,  we  also  present 
our  visual  notation  for  representing  HMS  machines,  in  Section  4  we  offer  a  brief  suininmaty 
and  the  conclusions. 


2  Background  and  Definitions 

We  begin  by  defining  the  non-hierarchical  version  of  a  specification  formalism  that  integrates 
a  parallel  version  of  automata  with  an  interval-based  temporal  logic.  Formally,  we  define  a 
discrete-time,  boolean  “multi-state  (MS)  machine”  as  a  triple  H  ~  {S,rD,^isi),  where 

1.  5  is  a  set  of  “states,”  any  number  of  which  may  be  true  or  “marked”  at  a  moment  of 
time. 

2.  To  and  are,  respectively,  the  sets  of  “deterministic”  and  “nondeterministic”  tran¬ 
sition  of  H.  Each  deterministic  or  nondeterministic  transition  is  of  the  form 

(PRIMARIES)  (CONTROL)  -  (CONSEQUENTS), 

where  PRIMARIES  C  5,  CONSEQUENTS  C  5  and  CONTROL  is  a  predicate  on  the 
history  of  the  states,  expressed  in  a  temporal  interval  logic  called  TIL.  For  a  transition 
u,  each  state  in  the  associated  PRIMARIES  (CONSEQUENTS)  set  of  state  will  be 
called  a  “primary”  (“consequent”)  state  of  u.  Also,  the  predicate  CONTROL  for  u 
will  be  called  the  “control”  or  “control  predicate”  of  u. 

3.  A  transition  is  “enabled”  if  its  primary  states  are  all  true  and  also  its  associated  control 
predicate  is  true. 

4.  At  each  discrete  moment  of  time  all  the  enabled  deterministic  transitions  and  a  subset 
of  the  nondeterministic  transitions  “fire,”  causing  changes  in  the  marking  of  the  states. 

The  set  of  hierarchical  MS  machine  or  “HMS  machines”  is  the  superset  of  the  set  of 
MS  machines  when  some  of  the  states  are  replaced  by  HMS  machines.  The  details  will  not 
be  presented  in  this  paper.  For  definitions  of  recursive  hierarchies  and  the  use  of  different 
granularities  of  time  at  different  level  of  hierarchy,  see  (Ga91a,  GI92].  Extensions  involving 
non-boolean  states  that  accommodate  data  flows  can  be  found  in  (GI90]. 

The  behavior  of  a  real-time  system  can  be  specified  in  terms  of  an  HMS  machine  by 
representing  its  attributes  as  hierarchical  states,  with  the  control  predicates  defining  the 
logical  and  tempord  constraints  under  which  changes  in  the  system  occur.  We  note  that  in 
an  HMS  machine  n.any  states  can  be  true  at  a  moment  of  time.  In  general,  this  results  in 
significant  reduction  in  the  number  of  states  compared  to  traditional  finite-state  machines. 

We  now  present  a  notation  and  formal  definitions  for  the  temporal  interval  logic  TIL  and 
the  state  updating  rule  for  HMS  machines. 
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Notation  Given  an  HMS  machine  H  with  state  set  S,  the  “marking'’  of  !i  at  lime  t  is  a 
mapping  Mt  :  S  —*  {jF'iT}  that  defines  the  set  of  marked  or  true  states  of  H . 

Definition  1  Given  a  marking  Mt  of  an  HMS  machine  at  time  t  and  a  formula  rf,  wc 
denote  the  satisfiability  of  if  in  Mt  by  Mt  ^  rp.  The  temporal  interval  logic  TIL  is  then 
obtained  by  extending  propositional  logic  with  the  following  four  operators: 

O(f')  At  relative  time  if 

Mt  \=  0{t')ip  ■O’  Mt^t'  N 
[ij,  fa]  Always  between  ti  and  fa 

N  ^  ti  <t'  <  t-i  implies  Mt  ]=  0{t')il) 

<fi,  fa>  Sometime  between  ti  and  fa 

Mt  ^2>  V’  3f'  such  that  fi  <  f'  <  fa  A  A/f  f=  0{t')ip 

<fi,  fa>!  Sometime-change  between  ti  and  fa 
A/t  [=<fi,fa>!t/'  <=►  such  that 

((ti  —  <  t'  <  fa)  A  {Mt  [=  O(f')-i^)  A  {Mt  ]=<f'  +  1,  fa>  fp)- 

Definition  2  For  each  state  s  in  an  HMS  machine,  let  rj„(s)(rouj(s))  be  the  set  of  transi¬ 
tions  into  (out  of)  s.  Then,  the  marking  of  s  at  the  next  moment  (f  =  1),  given  the  marking 
at  the  current  moment  (f  =  0),  is  defined  as  follows: 

0(l)s  o*  (s  A  (Au6r..,(^)  O(l)-^'u))  V  (Vv6r,„{^)  0{\)v), 

where  for  a  function  ip,  Arex  U  ^  Vzex  -  F  if  X  ~  {). 

Intuitively,  a  state  s  is  true  at  time  f  =  1  if  and  only  if  (1)  s  is  true  at  time  f  —  0  and  no 
transitions  fire  out  of  it  at  f  =  1,  and/or  (2)  some  transition  fires  into  s  at  time  f  =  1 . 

3  Transition  of  a  Specification  to  Design 

In  this  section,  we  present  an  example  of  an  HMS  machine  specification  of  a  simple  railroad 
operation  and  we  provide  an  outline  of  a  partial  transition  of  some  of  its  components  to  a 
design.  The  process  requires  two  steps.  In  the  first  step,  two  sections  of  this  specific  HMS 
machine  are  mapped  to  a  location  space  consisting  of  two  components:  (1)  a  software  process 
that  monitors  a  clock  and  sends  a  green  light  signal  for  a  new  train  to  start  on  the  track,  and 
(2)  a  mechanical  device  that  operates  a  gate  mechanism.  In  the  second  step,  the  section  of 
the  HMS  machine  specification  for  each  location  is  refined  to  reflect  design  considerations. 
We  note  that  the  two  steps  can  be  reversed,  i.e.,  it  is  possible  to  perform  the  refinement  first 
and  then  to  define  the  mapping  to  the  location  space.  In  fact,  more  choices  for  design  are 
possible  in  the  latter  case. 

Figure  1  presents  our  graphic  notation  for  an  HMS  meichine  representing  the  operation 
of  a  simple  railroad.  In  our  notation,  boxes  represent  states,  dark  arrows  denote  transitions, 
with  an  asterisk  indicating  that  the  transition  is  nondeterministic,  thin  arrows  represent 
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Figure  1:  HMS  Machine  Specification  of  a  Simple  Railroad  Operation 

controls,  and  temporal  operators  appear  next  to  the  symbol  (J).  VLSI  notation  is  used  to 
form  logical  combinations  of  control  predicates  and  a  TIL  predicate  of  the  form  <t,t>  is 
abbreviated  as  t  and  a  predicate  of  the  form  is  abbreviated  as  t\.  Also,  a  short  thick 

line  at  the  beginning  (end)  of  a  transition  denotes  the  special  state  that  is  always  true  (false). 
Thus,  the  nondeterministic  transition  into  the  state  “TYain  on  Track”  may  fire  if  that  state 
has  been  false  continuously  for  the  last  30  units  of  time.  Also,  the  transition  into  the  state 
“Gate  Down”  will  fire  if  the  state  “Train  on  Track”  was  false  and  became  true  10  units  of 
time  ago. 

In  Figure  1,  a  partial  allocation  of  the  states,  transitions  and  controls  of  the  HMS  machine 
into  the  two  locations  L\  and  L2  is  made,  as  indicated  by  the  shaded  regions.  In  Figure  2, 
L\  and  L2  are  refined  to  reflect  the  transition  of  two  parts  of  the  specification  to  a  simplified 
design.  As  mentioned  earlier,  we  can  assume,  for  example,  that  L\  is  to  be  implemented  in 
terms  of  a  software  process  that  monitors  a  clocks  and  sends  a  signal  to  a  green  light  to  allow 
a  new  train  on  the  track  30  time  units  after  the  previous  train  departs.  As  in  Figure  1,  the 
actual  arrival  is  left  nondeterministic  as  indicated  by  the  asterisk  next  to  the  transition  into 
the  state  “Start  TVain.”  The  location  L2  may  correspond  to  a  mechanical  device  that  starts  a 
mechanism  for  lowering  the  gate  2  time  units  after  the  state  “Train  on  TVack”  becomes  true. 
The  process  of  lowering  the  gate  takes  7  time  units.  Five  time  units  after  the  train  passes 
the  crossing,  the  gate  is  raised  automatically  so  that  it  is  no  longer  in  the  “Gate  Down” 
position. 

It  should  be  noted  that  in  Figures  1  and  2  we  employ  a  discrete-time  version  of  HMS 
machines,  in  which  transitions  fire  at  discrete  integer-valued  moments  of  time.  This  is 
consistent  with  the  usual  finite-state  machine  modeling  approach  to  representing  behavior. 
Under  this  assumption,  it  is  easy  to  prove  that  time  delays  in  location  L\  and  L2  of  Figure 
1  are  maintained  accuratedly  in  the  respective  locations  in  Figure  2.  Thus,  the  time  delay 
budget  of  10  time  units  for  the  gate  to  be  down  in  Figure  1  is  allocated  to  two  separate 
delays  plus  an  extra  tranisition  in  Figure  2.  With  the  use  of  continuous-time  HMS  machines 


Figure  2:  Partial  Design  of  Railroad  Operation 


(Ga91b],  the  extra  time  for  the  second  transition  would  not  be  required,  resulting  in  a  simpler 
demonstration  of  faithfulness  of  the  refinement  with  the  original  specification.  In  both  the 
discrete  and  continuous  cases,  the  TIL  language  can  be  used  to  formally  state  the  firing 
conditions  for  transitions  in  a  specification  and  its  refinement.  Safety  properties  can  also  be 
independently  verified  for  a  design  that  is  derived  from  a  specification. 

TVansformations  on  specifications  that  maintain  temporal  properties,  such  as  those  in 
[FG89,  GI92],  offer  another  promising  approach  to  transitionsing  a  specification  to  design 
in  a  way  that  guarantees  to  maintain  requirements  satisfied  by  the  specification.  However, 
as  noted  by  a  number  of  writers,  this  is  often  not  necessary.  For  example,  one  of  the  trans¬ 
formations  in  (FG89]  only  partially  preserves  behavior.  In  a  possible  application  of  such 
a  transformation  to  our  railroad  example,  one  could  investigate  the  design  of  the  software 
under  the  assumption  that  the  train  starts  immediately  after  the  green  light  is  turned  on. 
Understanding  of  the  environment  is  often  necessary  to  determine  the  usability  of  transfor¬ 
mations  that  do  not  strictly  maintain  behavior. 

We  have  considered  in  this  example  a  single-step  refinement  of  a  specification  that  main¬ 
tains  logical  and  temporal  properties.  The  allocation  of  the  refined  specification  to  a  set 
of  locations  can  be  considered  as  one  step  in  the  evolutionary  process  of  design.  Normally, 
a  number  of  repeated  refinement  are  necessary  to  reach  a  final  design.  To  consider  com¬ 
plex  data  flows,  an  extended  version  of  KMS  machines  {GI90]  can  be  employed  that  uses 
non-boolean  states  and  replaces  TIL  with  its  first-order  counterpart.  For  many  commonly- 
occurring  systems,  however,  the  boolean  version  presented  in  this  paper  is  quite  adequate. 
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For  example,  in  the  present  example,  the  boolean  case  is  sufficient  in  identifying  information 
flew  into  a  location  as  the  set  of  control  arrows  that  enter  its  boundaries. 


4  Summary  and  Conclusions 

In  this  paper,  we  presented  a  brief  overview  of  the  hierarchical  multi-state  (HMS)  machine 
specification  methodology  and  demonstrated  through  an  example  the  process  of  refining  a 
specification  to  a  design.  The  important  advantages  of  our  approach  are:  (1)  hardware,  soft¬ 
ware,  communication  elements  and  human  interactions  can  be  treated  in  a  uniform  manner, 

(2)  formal  verification  can  be  used  to  assure  the  correctness  of  the  original  specification, 

(3)  preservation  of  logical  and  temporal  properties  during  the  transition  to  design  can  be 
demonstrated  by  either  a  formal  verification  process  or  by  limiting  refinements  to  transfor¬ 
mation  that  preserve  or  partially  preserve  behavioral  properties,  and  (4)  executability  of  the 
visually-expressed  HMS  machine  formalism  provides  further  analysis  capabilities  in  either 
simulating  behavior  in  the  forward  direction  or  detecting  causes  of  errors  by  simulation  in  the 
backwavd  direction.  A  branching  backward  simulation  is,  in  fact,  the  basis  of  the  model-based 
theorem  proving  of  [Ga91b,  GIQl). 
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In  mo«t  ‘lyjitrm**,  th**  OS  xn*i  tKr  ihc 

ag«^fn^nt,  with  rAch  hAvinii'  »Ia  own  wrll  foJf*  jn  ihf  frwjuftr  maft*|{rmrn4  I  ferr  l 

a.H  units.  r*th’‘f  th-an  roop-rralmic  to  tb<*  of  of  |o4ntl>  »mj4<*tnrf.ti©#^ 

the  desireil  (unction altiy  pfo|u>«<  an  appro^uK  on  thr  contrpi  of  •4?«‘ftU  whi^h 

feside  in  the  Application  fun  tim**  ♦'nvuonmenl.  And  Are  cu«tomii«^j  lo  pf«%Klr  )u*t  tbiM«r  irin:*tir»r 
niAnagement  functions  whu  h  Afr  n»^<frd  tjy  thr  appU^ation  I  hr  *i^rnt  mipj** 

mentation  is  customiic*!  to  ihc  paflKulaf  OS  anti  tonfiuurAlion  ihu*  riidrwiini^  OS  trvrl 

knowiedj^e  With  this  approach,  wr  avooi  ihr  nrir*|  int  a  wipKi«l»r  ai^«*}  OS  wKnh  pfotud**#  a  saikIv 
of  jteneralijetj  functionality,  while  still  not  huftjrmnit  apjdn  ation  pri><^r ammrf »  with  Ke»vy 
biiity  for  rniourcf  manaiternrut  Wr  vhrduhnit  a^enu  as  thr  ba^«t*  of  a  fully  dutnhutrd. 

object-oriented  framework  called  R  Shell  for  budding  rrai  lime  Appht  ation*  »Kifh  can  alapl  to  dy 
TvamU:  satiations  in  tc-MsuTcr  tev^uitrments  and  asailatnUty  In  aildvtvon  to  whrd^iUnit  aitmVa, 

R-Shell  intl.ides  an  object-onente<f  (ys.  And  the  ron*  rpt  of  an  objeri  orw'nlefi  fe-«»<>ur<^e  hiefa/cKy  i<> 
describe  resource  requirements  and  rr^>ur*'e  charac left^u*  «,  And  «uppofU  a  nrw  tethntque  caiW 
resource  sulwititution. 


1  Introduction 


A  hard  rpai-lime  sysU'm  has  a  dual  r»'spotisihiliiy  of  not  only  produring  rr>rro<-i  results,  ImiI  aiw*  m'^ttnjc 
application  deadlines  white  producing  tties*-  results  [15.  3  ll!  3  hus,  there  are  !»..  corrertness  properties 
which  must  be  satisfied:  functional  rorreefnets  a;;i  .-'reorre  rnrorefne..  Further,  real-tinie  and  fault 
tolerant  systems  have  the  unique  eharartenstir  that  the  eorrerl  (timing)  behavior  of  (he  system  is 
affected  by  the  availability  of  all  the  rornpulational  resources  which  are  needeq  If  any  resour  ■.  such 
as  the  processor,  memory,  disk  drives,  files,  network  etc  are  even  t'mporaniy  unavailable,  then  the 
program  may  not  complete  before  the  deadline,  or  may  not  produce  correct  results  Thus,  ttming 
correctnfss  can  be  generalized  to  the  problem  of  rcmMccc  cnerretnee*  The  ner-d  to  attain  timing  and 
resource  correctness  while  maintaining  functional  rorrec tnes-s  makes  the  design  of  real  time  fault  tolerant 
systems  a  particularly  difficult  problem 

The  keys  to  achieving  timing  »and  resource  correctne.ss  are  the  concepts  of  prcdictnif/rfy  and  ffvaran- 
tees.  In  order  to  ensure  that  enough  time  and  resources  will  he  available  to  applications,  we  must  be  able 
to  determine  their  resource  requirenirnls  m  adc  (.nee,  and  then  set  up  resciurre  allocation  strategies  so 
that  all  of  the  requirements  are  met  To  obtain  predictability,  we  must  use  design  terhnique.s  wTich  make 
it  easier  to  determine  resource  requirements  either  analytically  or  enipincally.  at  cornpiie  time  Oikc 
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the  resource  requirements  are  known,  the  operating  system  must  be  designed  sti  that  it  can  proMile 
guarantees  to  applications  that  the  needed  resources  wilt  indeed  be  available 

Much  of  the  current  research  in  real-time  systems  focuses  on  these  two  problems,  with  language 
and  applications  designers  working  on  building  systems  with  fixed  resource  requirements,  and  operating 
systems  designers  working  on  developing  scheduling  algorithms  for  allocating  resources  so  that  tasks 
with  known  resource  requirements  can  meet  deadlines.  With  this  approach,  the  burden  of  ensuring 
timing  and  resource  correctness  at  run-tune  is  placed  entirely  on  the  OS.  'I'his  is  a  reasonable  approacii 
if  the  application  has  a  fixed  set  of  functions  to  perform,  operates  m  a  stable  environmem,  and  is 
primarily  repetitive  in  nature,  such  as  signal  processing  applications  However,  it  does  not  work  so  well 
if  the  resource  requirements  of  tasks  may  vary  at  run-time,  or  if  resource  availability  can  change  during 
execution  due  to  the  occurrence  of  faults,  or  due  to  preemption  of  resources  by  higher  priority  tasks 

On  the  other  hand,  some  systems  use  models  where  the  responsibility  for  meeting  deadlines  is  placed 
on  the  application,  with  the  OS  providing  some  support  capabilities  which  the  application  can  utiliie 
This  approach  can  deal  with  resource  scarcity  situations  by  providing  appropriate  exception  handlers  in 
the  application.  The  difficulty  is  that  this  burdens  the  application  programmer  heavily.  Moreover,  this 
approach  is  feasible  only  if  there  are  a  relatively  small  number  of  situations  which  the  application  must 
deal  with,  otherwise  providing  handlers  for  all  combinations  of  possible  situations  becomes  intrautable. 

However,  a  large,  complex,  and  highly  dynamic  application  such  as  an  aircraft  control  system  may 
have  a  variety  of  timing  and  resource  requirements  to  meet,  and  these  may  change  constantly.  The 
resources  available  to  the  OS  may  also  change  as  the  operating  environment  changes.  To  meet  this  chal¬ 
lenge  of  handling  a  wide  variety  of  complex  dynamic  situations  involving  resources,  it  is  necessary  that 
the  OS  and  the  application  should  work  together  and  ensure  that  the  application  resource  requirements 
match  the  actual  current  availability  of  computational  resources.  Our  design  of  the  R-Shell  run-time  sup¬ 
port  system  is  aimed  at  an  integrating  the  design  of  applications  with  OS  capabilities,  and  at  facilitating 
run-time  co-operation  between  application  and  OS. 

The  R-Sheli  system  is  based  on  the  concept  of  scheduling  agents,  which  interface  between  the  ap¬ 
plication  and  the  OS,  and  perform  resource  management  functions.  Scheduling  agents  are  part  of  the 
run-time  environment  for  a  particular  application,  and  are  generated  at  compile-time  to  provide  exactly 
that  functionality  which  is  needed  by  the  application.  They  reside  on  top  of  an  object-oriented  OS. 
They  can  be  designed  to  provide  any  resource  management  functionality  appropriate  to  the  application, 
such  as  obtaining  guarantees  from  the  OS  that  a  method  invocation  in  an  object-oriented  real-time 
application  will  have  enough  resources  to  meet  its  deadline. 

In  this  document,  we  describe  the  motivation  for  the  R-Sheli  approach,  and  our  design  of  the  R-Shell 
system.  The  rest  of  this  section  describes  the  complexity  of  the  problem  of  building  distributed  fault- 
tolerant  real-time  systems.  Section  2  presents  a  characterization  of  alternative  approaches  which  may  be 
employed  to  address  this  problem,  and  where  several  of  the  existing  systems  fit  into  this  characterization. 
In  section  3,  we  present  our  design  of  the  R-Shell  system.  Section  4  summarizes  the  discussion. 


Problem  description:  The  complexity  of  large  real-time  systems 

A  large  real-time  system  typically  consists  of  many  tasks,  each  with  their  own  individual  requirements 
Moreover,  in  applications  such  as  aircraft  control,  these  tasks  may  have  substantial  inherent  unpre¬ 
dictability.  There  may  be  several  application  characteristics  which  complicate  the  process  of  resource 
allocation  in  the  system: 
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•  Deadlines;  The  resource  allocation  process  must  ensure  that  deadlines  of  critical  o|)erations  are 
met.  Failure  to  meet  these  deadlines  may  have  catastrophic  consequences  for  the  system, 

•  Periodic  tasks:  The  application  may  include  some  functions  such  as  data  acquisition  and  moni¬ 
toring  which  must  be  performed  repeatedly. 

•  Aperiodic  tasks:  User  queries  as  well  as  unexpected  external  events  can  both  give  rise  to  non¬ 
periodic  requests.  The  OS  must  service  these  requests  without  jeopardizing  the  deadlines  of  critical 
tasks. 

•  Precedence  and  grouping  constraints:  There  may  be  relationships  among  different  tasks 
constituting  an  application:  they  may  share  data,  or  one  may  process  data  produced  by  another.  In 
addition  to  these  synchronization  and  precedence  requirements,  there  may  be  grouping  constraints 
among  several  tasks  which  cooperate  to  perform  a  function,  so  that  their  result  is  useful  only  if  all 
of  them  complete. 

•  Resource  usage:  Some  tasks  may  requite  resources  with  specific  characteristics,  or  may  need 
varying  amounts  of  resources  in  different  executions.  For  example,  a  particular  task  may  need  a 
32-bit  processor  with  a  floating  point  accelerator  in  order  to  produce  precise  and  timely  results. 
The  processing  and  network  bandwidth  of  a  radar  system  may  depend  on  the  number  of  targets 
being  tracked  currently. 

•  Support  for  application-specific  techniques:  The  application  may  use  particular  techniques 
such  as  recovery  blocks  [9],  and  imprecise  computation  [5]  to  handle  specific  situations  involving 
scarcity  of  resources.  The  system  designers  need  to  ensure  that  the  OS  resource  management 
policies  are  compatible  with  the  various  application-specific  techniques. 

In  addition  to  this  complex  application  model,  the  computational  environment  may  itself  have  several 
characteristics  which  make  resource  management  still  more  difficult: 

•  Faults:  The  requirements  for  fault-tolerance  and  degraded  modes  of  operation  imply  that  resource 
availability  in  the  system  may  vary  in  different  situations,  and  applications  should  be  able  to  adapt 
their  behavior  to  match  the  changing  resource  availability. 

•  Resource  limitations;  Even  under  normal  operation,  the  limited  availability  of  resources  may 
create  a  problem  if  application  resource  needs  vary.  For  example,  the  network  may  become  a 
bottleneck  if  the  radar  system  needs  to  track  many  different  targets. 

•  Dynamics:  The  resource  charactersitics  may  vary  as  application  characteristics  vary.  For  example, 
network  behavior  changes  under  different  levels  and  different  types  of  network  load,  making  it 
difficult  to  achieve  predictability. 

•  Emergencies:  When  emergency  situations  occur,  resource  allocation  must  be  modified  to  devote 
maximum  resources  to  emergency  handling.  This  requires  that  the  scheduling  of  all  resources, 
including  networks,  should  be  based  on  preemptive  priority-driven  schemes. 

We  can  divide  this  bewildering  variety  of  requirements  into  four  categories  of  problems  which  the 
resource  management  strategy  must  address: 

1.  Scheduling  and  resource  allocation  to  meet  deadlines,  and  to  handle  precedence  and  grouping 
constraints  for  periodic  and  aperiodic  tasks. 
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2.  Obtaining  and  using  semantic  information  about  applications,  including  variations  in  resource 
requirements,  and  needs  for  specific  resources. 

3.  Handling  of  faults  and  emergencies. 

4.  Providing  support  for  application-specific  techniques. 

OS  design  for  fault-tolerant,  distributed,  real-time  systems  is  an  extremely  complicated  problem, 
since  it  must  cater  to  a  variety  of  application  needs  in  a  highly  dynamic  computational  environment. 
It  is  our  belief  that  it  is  difficult  to  handle  this  complexity  either  purely  in  the  OS  or  purely  in  the 
application  that  motivates  our  approach  of  cooperation  between  application  and  OS  in  addressing  these 
issues. 


2  Classification  of  current  approaches 

Conventionally,  issues  relating  to  resource  management  are  handled  entirely  in  the  OS.  It  is  considered 
desirable  to  free  the  application  from  the  burden  of  worrying  about  computational  resources.  However, 
this  is  typically  not  possible  in  a  dynamic  real-time  system.  Only  the  application  may  have  knowledge  of 
variations  in  resource  requirements.  The  handling  of  faults  and  overload  situations  may  be  application- 
specific.  Application  semantic  information  may  be  needed  for  the  scheduler  to  ensure  that  deadlines 
will  be  met.  For  all  these  reasons,  the  responsibility  for  resource  management,  and  by  extension  the 
handling  of  dynamic  behavior,  is  usually  shared  between  the  application  and  the  OS.  We  characterize  the 
approaches  to  addressing  the  problem  based  on  the  degree  to  which  each  of  them  share  this  responsibility. 

There  is  a  spectrum  of  possible  approaches,  ranging  from  handling  the  problem  entirely  in  the 
OS,  to  handling  it  entirely  at  the  application  level.  The  variation  in  the  division  of  responsibility  is 
really  a  continuous  one,  so  that  we  cannot  provide  a  strict  enumeration  of  the  different  possibilities. 
Nevertheless,  we  arrive  at  a  broad  classification  of  the  solutions  into  four  approaches,  which  correspond 
to  basic  differences  in  design  philosophy.  Figure  1  illustrates  this  classification  pictorialiy. 

The  following  is  our  characterization  of  the  solution  approaches: 

•  Application-controlled:  At  one  extreme  is  the  situation  where  the  OS  does  not  provide  any 
special  support  whatsoever.  The  application  must  incorporate  all  the  techniques  needed  to  handle 
dynamic  situations.  This  is  the  default  in  current  practice. 

•  OS-controlled:  The  other  extreme  is  the  situation  where  the  OS  takes  all  the  responsibility  for 
performing  scheduling  to  ensure  that  deadlines  are  met,  without  any  input  from  the  application. 
While  this  is  convenient  from  the  viewpoint  of  the  application  designer,  it  is  inherently  limited  in 
terms  of  the  issues  which  it  can  address.  In  particular,  it  cannot  deal  well  with  unpredictable  vari¬ 
ations  in  application  resource  needs,  and  does  not  support  application-specific  techniques  (which 
might  produce  better  results). 

•  OS-controlled  using  application  semantics:  This  approach  puts  the  bulk  of  the  responsibility 
on  the  OS,  with  the  application  supplying  it  with  semantic  information,  such  as  dynamic  deadlines, 
resource  requirements,  criticality  and  value  function  information.  This  approach  is  more  powerful 
than  the  previous  one,  and  still  does  not  place  a  heavy  burden  on  the  application.  Like  the  previous 
approach,  it  requires  a  sophisticated  OS  design. 
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Figure  1 .  Spectrum  Of  Approaches  To  Handling  Dynamic  Behavior 


•  Application-controlled  with  OS  support:  This  approach  lets  the  application  do  the  work 
of  dealing  with  dynamic  behavior,  but  the  OS  contains  features  and  mechanisms  with  which  the 
application  can  obtain  status  information  and  modify  system  behavior.  For  example,  the  OS  may 
provide  information  about  resource  availability  and  current  resource  usage  by  other  applications, 
and  allow  applications  to  customize  the  scheduling  policies  and  modify  resource  allocations.  With 
this  approach,  applications  have  considerable  flexibility  in  terms  of  implementing  different  tech¬ 
niques  to  deal  with  situations. 

In  [8],  we  discuss  several  current  systems,  including  ARTS  [13],  CHAOS  (2],  Concord  (5,  7),  GARTEN 
[10,  6),  MARUTI  [4]  and  Spring  [12,  14],  and  where  they  fit  into  this  characterization. 


3  The  R-Shell  system 

We  propose  a  system  called  R-Shell  that  provides  run-time  support  for  large,  complex,  fault-tolerant 
distributed  real-time  applications.  R-Shell  consists  of  an  object-oriented  OS,  object-oriented  applica¬ 
tions,  and  scheduling  agents  which  interface  between  the  applications  and  the  OS.  The  design  of  R-Shell 
is  based  on  the  approach  of  co-operative  resource  management.  In  terms  of  the  classification  pre¬ 
sented  in  Figure  1,  co-operative  resource  management  does  not  correspond  to  a  particular  point  on  the 
spectrum.  Instead,  it  spans  the  entire  spectrum,  and  individual  applications  designers  can  define  the 
respective  roles  of  OS  and  applications  based  on  the  needs  and  characteristics  of  the  application. 

In  this  approach,  the  OS  and  the  application  share  the  responsibility  for  resource  management,  based 
on  some  systematic  design  methodology.  It  must  be  emphasized  that  this  approach  is  not  the  same  as 
an  equal  division  of  responsibility;  there  is  an  associated  design  methodology  using  which  application 
designers  set  out  the  roles  of  the  application  and  the  OS,  and  decide  the  modes  in  which  they  interact 
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and  co-operate  to  achieve  the  overall  goal.  The  power,  flexibility  and  utility  of  this  approach  depend  on 
the  methodology  itself.  A  good  methodology  can  lead  to  a  well-integrated  design  where  the  application 
and  the  OS  dovetail  perfectly,  taking  responsibility  for  just  those  aspects  of  the  overall  behavior  and 
functionality  which  they  are  best-equipped  to  handle. 

In  R-Shell,  the  scheduling  agents  which  interface  between  the  application  and  OS  are  constructed  to 
fit  the  needs  of  particular  applications.  The  OS  capabilities  they  utilize  and  the  functionality  which  they 
provide  to  the  application  can  both  be  determined  by  applications  designers  based  on  the  implementation 
platform  and  application  requirements.  However,  the  scheduling  agents  are  not  a  part  of  the  application. 
They  are  part  of  the  run-time  support  system  provided  by  the  software  development  environment. 


3.1  The  R-Shell  framework 

The  R-Shell  system  is  based  on  an  open  system  concept,  that  each  application  should  be  able  to  obtain, 
and  pay  the  performance  penalty  for,  exactly  those  elements  of  functionality  which  it  requires.  The 
problem  with  incorporating  many  features  in  the  OS  is  that  the  OS  becomes  very  complex,  and  sJso  quite 
slow,  which  is  unacceptable  in  some  real-time  applications.  On  the  other  hand,  if  the  application  must 
incorporate  many  different  resource  management  techniques,  there  is  a  heavy  burden  on  the  applications 
programmer,  and  also  there  may  be  a  performance  loss  since  resource  management  is  not  tailored  to 
the  particular  system  configuration.  Run-time  environments  (RTEs),  on  the  other  hand,  are  developed 
for  particular  implementation  platforms,  and  moreover,  compilers  ensure  that  RTE^  contain  just  those 
elements  of  functionality  which  a  particular  application  needs.  Therefore,  we  believe  that  RTE)s  are  the 
ideal  components  for  incorporating  sophisticated  resource  management  features  which  may  be  needed 
only  by  some  applications.  Moreover,  since  RTEs  are  generated  in  the  context  of  a  particular  application, 
they  can  utilize  application  semantics  to  implement  application-specific  features.  Hence  the  key  feature 
of  R-Shell  is  the  use  of  scheduling  agents  which  are  part  of  the  run-time  support  environment,  smd 
interface  between  the  application  and  the  OS. 

It  should  be  noted  that  including  resource  management  functions  in  the  RTE  is  not  in  itself  a  novel 
idea.  Conventional  RTEs  do  manage  memory  with  heaps  and  stacks.  The  Ada  RTE  does  perform 
scheduling  functions  for  tasks  within  an  Ada  application.  The  novelty  of  our  approach  is  in  the  use 
of  scheduling  agents  in  the  RTE  to  take  advantage  of  both  semantic  and  configuration  information, 
in  giving  application  programmers  control  over  the  functionality  provided  by  the  agents,  and  in  the 
extensive  capabilities  and  responsibility  which  we  propose  for  scheduling  agents. 

Figure  2  shows  the  structure  of  the  R-Shell  system.  The  RTE  for  each  application  includes  a  schedul¬ 
ing  agent  which  performs  resource  management  functions  for  that  application.  Each  type  of  physical 
resource  has  a  resource  manager  which  schedules  use  of  resources  of  that  type.  The  scheduling  agent 
interacts  with  the  resource  managers  for  each  individual  resource  to  obtain  all  the  resources  needed  by 
an  application.  Resource  managers  interact  with  each  other  to  coordinate  the  allocation  of  resources  to 
different  applications. 

Scheduling  agents  can  perform  a  variety  of  different  resource  management  functions,  depending  on 
the  needs  of  the  application,  and  the  facilities  provided  by  the  underlying  OS.  Throughout  the  rest  of 
this  document,  we  illustrate  the  use  of  scheduling  agents  by  describing  some  typical  functions  needed 
in  a  real-time  system.  In  particular,  real-time  systems  need  to  express  resource  needs,  obtain  guarantees 
about  resource  allocation,  and  handle  exception  situations  where  resources  suddenly  become  unavail¬ 
able  due  to  faults  or  preemptions.  We  will  discuss  the  concepts  of  R-Shell  by  describing  an  OS  which 
provides  guarantee  and  exception  notification  features,  and  an  application  which  needs  to  express  its 
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icsource  requirements,  to  obtain  guaranteed  resource  allocation,  and  lo  handle  resource  exception  situ¬ 
ations.  However,  it  should  be  noted  that  scheduling  agents  can  be  designed  to  implement  any  resource 
management  functionality  which  is  appropriate  to  the  needs  of  a  particular  application. 

The  R-Shell  approach  represents  an  integration  of  the  functionality  of  real-time  applications  and  of 
the  OS,  with  respect  to  re.source  management.  This  integration  is  accomplished  by  the  use  of  scheduling 
agents.  Conventionally,  the  OS  is  a  configuration-dependent  entity,  which  coordinates  the  allocation 
of  resources,  and  handles  situations  such  as  faults  using  general  techniques  which  are  independent  of 
application  semantics  but  may  be  dependent  on  resource  characteristics,  such  as  process  migration, 
message  rerouting,  and  replication  of  remote  procedure  calls.  The  application  handles  resource-related 
situations  using  techniques  which  may  exploit  application  semantics,  but  are  often  independent  of  the 
system  configuration,  such  as  fault  recovery  procedures,  handling  memory  and  file  allocation  errors,  and 
version  selection  for  imprecise  computation.  In  R-Shell,  scheduling  agents  can  utilize  either  application 
semantics  or  configuration  information  or  both,  and  implement  any  of  these  features  as  necessary.  Thus, 
instead  of  locking  in  the  roles  of  the  OS  and  the  application,  scheduling  agents  allow  application  designers 
to  select  the  kind  of  behavior  they  want. 

The  salient  features  of  the  R-Shell  approach  are: 

•  Flexible  scheduling  strategy:  The  scheduling  policies  of  the  system  can  be  modified  easily  by 
changing  the  scheduling  agent  functionality.  For  example,  different  programming  languages  can 
provide  different  scheduling  agents  to  reflect  their  design  philosophy.  It  is  also  easier  to  utilize 
application  semantics  to  make  more  intelligent  scheduling  decisions. 
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•  Use  of  object-oriented  model:  The  R-Shell  approach  is  truly  object-oriented,  in  that  each 
application  object  is  autonomous  in  its  scheduling  decisions.  Most  “object-oriented”  OS  designs 
include  a  centralized  scheduler,  which  makes  ail  scheduling  decisions  for  all  objects.  The  behavior 
and  correctness  of  every  application  program  depends  on  the  scheduling  policies  implemented  in 
this  central  scheduler,  and  on  the  resource  requests  made  by  other  applications  to  the  central 
scheduler.  This  is  in  violation  of  the  object-oriented  philosophy,  according  to  which  each  object 
should  be  an  independent  self-contained  entity,  which  can  be  designed,  implemented  and  verified 
independently.  In  R-Shell,  since  each  application  incorporates  its  own  scheduling  agent,  which 
handies  all  situations  relating  to  resource  management,  the  application  is  more  insulated  from 
external  scheduling  decisions,  and  from  other  applications.  This  is  a  more  object-oriented  model 
fo  system  design,  and  as  we  discuss  later,  provides  portability  and  reusability  even  for  real-time 
application  software.  Also,  the  object-oriented  nature  of  the  OS  and  the  applications  is  exploited 
in  the  design  of  the  interfaces  between  the  different  system  components,  and  in  the  modeling  of 
resources  in  R-Shell. 

•  Fully  distributed  scheduling:  Since  the  scheduler  in  many  object-oriented  operating  systems 
is  centralized,  it  constitutes  a  performance  bottleneck,  as  well  as  a  single  point  of  failure.  The 
dependence  of  real-time  behavior  on  system  and  network  load  also  makes  it  difficult  to  realize 
many  of  the  advantages  of  distributed  systems,  such  as  process  migration  and  reconfiguration  for 
fault-tolerance.  In  R-Shell,  the  scheduling  is  fully  distributed,  since  each  resource  type  has  its 
own  resource  manager,  and  every  application  its  own  scheduling  agent.  There  is  no  single  point 
of  failure.  Also,  since  the  scheduling  agent  enables  applications  to  adapt  to  different  situations  of 
resource  availability,  it  is  possible  to  use  techniques  such  as  process  migration  and  reconfiguration, 
and  perhaps  even  to  port  the  application  to  a  different  platform,  and  still  obtain  correct  real-time 
behavior. 

In  the  rest  of  this  section,  we  describe  some  of  the  features  of  each  component  of  the  R-Shell  system, 
and  how  they  are  used  to  build  systems.  However,  before  we  launch  into  a  description  of  the  components, 
we  first  present  our  object-oriented  hierarchical  model  of  resources,  which  provides  the  conceptual  base 
on  which  resource  management  in  R-Shell  is  built.  This  model  is  used  not  only  to  represent  the  actual 
physical  resources  available,  but  also  to  express  resource  requirements  and  resource  characteristics. 


3.2  Resource  modeling 

One  of  the  innovative  features  of  R-Shell  is  that  it  models  resources  using  an  object-oriented  reaottree 
hierarchy.  This  hierarchy  concept  allows  applications  to  express  their  resource  requirements  more  pre¬ 
cisely.  Moreover,  it  facilitates  the  innovative  technique  of  resource  substitution  (described  below)  which 
replaces  an  unavailable  resource  with  some  other  resource  whose  properties  most  closely  match  the 
desired  properties.  This  technique  is  used  in  R-Shell  for  handling  resource  scarcity  due  to  faults  or 
preemptions. 

For  each  type  of  computational  resource,  R-Shell  describes  its  characteristics  with  an  object-oriented 
class  description.  The  methods  of  this  class  description  correspond  to  the  functionality  provided  by  the 
resource,  and  the  state  variables  capture  the  properties.  Thus,  a  processor  resource  may  have  a  class 
description  which  includes  methods  such  as  execute-instruction,  service-interrupt  etc.  The  state 
variables  may  include  speed  and  number-of-interrupt-levels.  Resource  objects  arc  instantiated  from 
this  class  description  to  represent  the  actual  physical  resources  in  the  system. 
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Figure  3:  A  (Partial)  Object-Oriented  Resource  Hierarchy 


Resource  descriptions  in  R-ShcH  are  organized  into  a  resource  hierarchy.  Figure  3  shown  an  example 
of  a  partial  object-oriented  class  hierarchy.  The  various  resource  class  descriptions  in  the  system  form 
an  object-oriented  hierarchy.  As  we  go  down  the  hierarchy,  the  resource  subclasses  model  the  resource 
to  a  greater  level  of  detail,  and  describe  some  specific  subset  of  the  resources.  Thus,  Network  may  have 
a  subclass  Token-ring,  with  additional  methods  such  as  request-token  and  additional  properties  such 
as  token-round-trip-time.  The  purpose  of  the  hierarchy  is  to  capture  the  properties  of  computational 
resources  to  different  levels  of  detail,  and  to  distinguish  between  generic  and  specific  resource  types. 
With  this,  applications  can  express  their  needs  for  specific  types  of  resources,  by  requesting  resources 
belonging  to  a  particular  resource  subclass. 

The  resource  hierarchy  is  known  to  all  components  of  R-Shell.  The  OS  uses  the  resource  hierarchy  to 
represent  physical  resources,  and  keep  track  of  resource  allocation.  Based  on  the  resource  hierarchy,  the 
OS  also  knows  which  particular  physical  resource  can  satisfy  the  request  for  a  generic  resource  class  such 
as  Primary jnemory.  Applications  use  the  resource  hierarchy  to  express  their  resource  requirements 
and  the  assumptions  they  make  about  resource  characteristics.  Scheduling  agents  use  the  resource 
hierarchy  to  direct  resource  requests  to  the  appropriate  resource  managers.  The  resource  hierarchy  also 
facilitates  the  innovative  technique  of  resource  substitution,  described  next. 


Resource  substitution 

In  R-Shell,  applications  modules  (methods)  express  their  requirements  of  all  computational  resources 
(actually,  most  of  the  resource  requirements  are  derived  using  a  schedulabilily  analyzer,  and  incorporated 
into  the  scheduling  agent,  as  described  later).  Moreover,  using  the  resource  hierarchy,  they  express  the 
set  of  desired  properties  and  characteristics,  and  the  assumptions  they  make  about  the  resource.  Thus, 
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one  application  may  simply  need  a  processor  which  can  service  interrupts,  whereas  another  may  depend 
on  the  interrupt  being  serviced  within,  say,  1  microsecond.  Another  application  may  only  work  correctly 
if  the  processor  includes  a  floating  point  accelerator.  Since  real-time  systems  typically  depend  on  specific 
resource  characteristics,  this  mechanism  provides  critical  semantic  information  to  the  scheduling  agent. 

Using  this  semantic  information,  the  agent  can  perform  more  intelligent  resource  allocation.  If 
a  processor  fails,  and  the  application  must  be  relocated  to  a  different  processor,  the  agent  has  the 
information  to  decide  which  of  the  other  processors  in  the  system  will  be  acceptable,  particularly  since  the 
application  expresses  all  its  resource  needs,  such  as  the  computational  servers  it  requires,  the  assumptions 
it  makes  about  communications  delay  to  other  nodes,  the  communication  bandwidth  needed  etc.  The 
agent  can  therefore  deal  with  resource  unavailability  by  substituting  the  scarce  or  unavailable  resource 
with  some  other  resource  which  is  available. 

One  form  of  resource  substitution  occurs  when  a  resource  is  replaced  with  another  similar  or  equiva¬ 
lent  resource.  If  an  exact  equivalent  resource  (which  has  all  the  properties  desired  by  the  application)  is 
not  available,  the  scheduling  agent  may  use  the  information  cont2uned  in  the  resource  hierarchy  to  iden¬ 
tify  another  resource  which  matches  most  of  the  properties  and  may  still  be  acceptable.  For  example,  it 
may  provide  a  processor  with  fewer  registers  or  without  a  floating  point  accelerator.  Since  all  the  desired 
properties  are  not  met,  the  application  may  produce  an  approximate,  imprecise  result.  Thus  this  form 
of  resource  substitution  is  similar  to  imprecise  computation;  it  has  the  same  effect  of  trading  off  result 
quality  to  deal  with  resource  unavailability. 

Another  form  of  resource  substitution  occurs  when  the  needs  of  one  type  of  resource  are  reduced  by 
providing  more  of  another  type  of  resource.  Computations  can  often  make  tradeoffs  between  time  and 
memory,  between  time  and  communication  bandwidth,  etc.  Some  of  these  tradeoffs  can  be  built  into 
the  scheduling  agent,  allowing  it  to  choose  from  alternative  versions  of  library  routines,  such  as  different 
sorting  algorithms  with  different  resource  usage.  For  example,  the  agent  can  increase  communication 
bandwidth  allocation  to  reduce  communication  delay,  or  insert  message  compression  and  decompression 
algorithms  to  reduce  communications  bandwidth  usage  at  the  expense  of  longer  message  delays. 

R-Shell  scheduling  agents  can  support  both  forms  of  resource  substitution.  Resource  substitution 
can  be  used  for  handling  faults  and  preemptions,  both  at  the  application  and  the  OS  level. 

3.3  Resource  managers 

The  OS  in  R-Shell  consists  of  a  collection  of  resource  manager  objects.  Each  type  of  physical  resource 
in  the  system  has  an  OS  resource  manager  associated  with  it,  which  schedules  resources  of  that  type. 
Resource  managers  maintain  a  list  of  resource  objects,  corresponding  to  actual  resources  available. 
They  receive  requests  from  applications  for  resources,  and  allocate  resource  objects  to  them.  When 
an  applications  module  makes  requests  for  several  difierent  types  of  resources,  the  scheduling  agent  for 
the  application  sends  requests  to  the  resource  managers  for  each  type  of  resource.  Resource  managers 
interact  with  each  other  to  coordinate  scheduling  of  different  applications,  to  avoid  deadlock.  When 
a  resource  manager  is  able  to  satisfy  a  resource  request,  it  provides  the  application  with  a  guarantee 
that  the  needed  resources  will  be  provided.  R-Shell  supports  multiple  levels  of  guarantees,  including 
an  absolute  guarantee  (subject  to  faults),  a  guarantee  subject  to  preemptions  by  later  higher-priority 
jobs,  and  a  best-effort  guarantee  (not  quite  the  same  as  no  guarantee,  because  of  exception  notification, 
described  below).  Applications  can  select  the  level  of  guarantee  they  require;  they  can  also  request  a 
lower-level  guarantee  if  a  higher-level  guarantee  is  refused. 

Upon  request,  resource  managers  can  provide  information  about  resource  availability,  and  also  accept 
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messages  from  applications  specifying  information  about  resource  usage,  such  as  preferences  for  certain 
resources.  Resource  mamagers  send  exception  notification  messages  to  applications  if  a  guarantee  cannot 
be  satisfied,  due  to  faults,  or  preemption  of  resources  by  higher  priority  tasks  (in  the  case  of  the  best- 
effort  guarantee).  Under  these  circumstances,  if  the  resource  manager  cannot  maintain  the  guarantee,  it 
sends  a  message  to  the  application  (i.e.  an  upcall)  notifying  it  of  the  resource  exception.  These  messages 
enable  applications  to  perform  exception  handling. 

3.4  Support  for  building  applications 

R-Shell  is  designed  to  facilitate  building  object-oriented  real-time  applications.  The  support  provided 
includes  schedulabiiity  analysis  tools  to  determine  resource  requirements,  scheduling  agents  to  perform 
resource  management,  and  language  primitives  for  expressing  resource  needs,  for  requesting  guarantees, 
for  performing  exception  handling,  and  for  defining  multiple  versions  of  methods  if  imprecise  computation 
techniques  are  used.  Because  of  its  widespread  popularity,  and  the  easy  avmlability  of  a  public  domain 
compiler  (GNU  C-b-l-)  which  can  be  modified,  we  have  chosen  to  add  the  primitives  onto  C-f -f ,  though 
they  can  as  easily  be  added  to  any  other  object-oriented  language.  The  support  provided  by  R-Shell  for 
building  real-time  applications  includes: 

•  Expression  of  resource  requirements:  A  major  emphasis  in  R-Shell  is  on  complete  expression 
of  resource  requirements.  Current  programming  languages  use  ad  hoc  primitives  to  express  resource 
needs,  such  as  malloc  for  memory,  opening  files  to  read  data,  opening  sockets  to  get  access  to  the 
network  etc.  Also,  in  current  systems,  programs  do  not  describe  their  assumptions  about  resources, 
though  they  do  make  assumptions  about  network  speed,  network  bandwidth,  processor  speed  etc. 
This  is  a  major  reason  why  programs,  particularly  real-time  and  fault-tolerant  programs,  are  not 
portable  to  configurations  other  than  the  one  they  were  written  for. 

Using  the  object-oriented  paradigm,  R-Shell  allows  the  expression  of  all  resource  needs  as  requests 
for  particular  kinds  of  resource  objects.  The  R-Shell  programming  language  supports  the  concepts 
of  resource  objects  described  earlier,  and  the  language  definition  includes  descriptions  of  a  standard 
set  of  resource  objects,  with  parameters  which  represent  their  characteristics.  Applications  send 
requests  to  the  OS,  asking  for  allocation  of  instances  of  particular  kinds  of  resource  objects.  Then 
they  make  calls  (send  messages)  to  the  resource  objects  to  perform  operations  on  them,  such  as 
sending  a  message  or  reading  some  input  from  a  file. 

•  Deriving  resource  requirements  from  schedulabiiity  analysis:  The  language  incorporates 
a  schedulabiiity  analyzer  which  uses  a  combination  of  analytical  techniques  and  test  runs  to  derive 
the  resource  requirements  of  programs,  and  automatically  incorporate  requests  for  resources  when 
they  are  not  explicitly  coded  in  by  the  user.  This  feature  relieves  the  application  programmer  of 
the  burden  of  coding  in  explicit  requests  for  all  resources,  and  particularly  of  having  to  estimate 
execution  times.  The  design  of  the  analyzer  is  based  our  previous  design  of  the  GARTAAN  schedu- 
iability  analyzer  [10].  If  the  method  requires  any  special  resources  (such  as  sensors  and  effectors) 
which  cannot  be  inferred  by  the  analyzer,  or  if  it  makes  specific  assumptions  about  resource  char¬ 
acteristics,  then  the  applications  programmer  must  insert  requests  for  the  corresponding  resource 
class.  The  information  derived  from  the  schedulabiiity  analyzer  is  incorporated  into  the  scheduling 
agent,  as  described  in  section  3.6. 

•  Requests  for  guarantees:  The  programming  language  also  supports  the  notion  of  guarantees 
for  method  invocations.  It  provides  the  primitive  guarantee  which  can  be  applied  to  a  method, 
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which  requests  the  OS  (and  the  scheduling  agent)  to  guarantee  that  resources  will  be  available 
to  guarantee  the  method.  The  guarantee  primitive  returns  a  value  which  indicates  the  level  of 
guarantee  provided  to  the  method.  If  the  guarantee  is  refused,  the  application  can  determine  this 
from  the  return  value,  and  take  appropriate  exception  handling  action. 

•  Exception  handling:  Application  objects  can  also  include  methods  which  process  exception  noti¬ 
fication  messages  from  the  operating  system.  If  a  particular  type  of  resource  becomes  unavailable, 
the  OS  sends  the  application  a  message  notifying  it  of  the  exception.  Applications  can  handle 
these  exceptions  by  defining  appropriate  exception  handler  methods.  These  exception  handlers 
facilitate  the  incorporation  of  various  fault-tolerance  schemes  in  application  objects,  so  that  each 
object  C2U1  encapsulate  its  own  resource  exception  handling.  This  is  in  contrast  to  current  systems, 
where  fault  detection  must  be  done  explicitly  by  the  application  by  setting  up  and  catching  signals, 
or  other  similar  mechanisms,  and  the  language  may  provide  little  or  no  support  for  performing 
fault- tolerance. 

•  Version  selection  for  imprecise  computation:  If  the  application  chooses  to  use  the  techniques 
of  imprecise  computation  for  handling  resource  scarcity  and  faults,  then  the  programming  language 
provides  support  for  defining  multiple  versions,  and  for  performing  version  selection  at  run-time. 
Application  objects  may  define  several  different  implementations  of  any  method,  using  the  tech¬ 
niques  of  imprecise  computation.  These  different  versions  differ  in  their  resource  requirements, 
and  produce  different  qualities  of  result.  At  run-time,  that  version  is  selected  for  execution  whose 
result  quality  is  the  best,  from  among  those  whose  resource  requirements  can  be  guaranteed.  This 
version  selection  can  be  performed  by  the  scheduling  agent  for  the  application. 

It  should  be  noted  that  the  concept  of  scheduling  agents  is  language-independent,  and  that  scheduling 
agents  can  be  provided  even  for  languages  which  do  not  provide  the  special  support  facilities  described 
here.  These  particular  facilities  are  provided  in  R-Shell  because  we  believe  they  are  very  useful  for 
building  real-time  applications,  however  other  languages  and  OS  designs  may  prefer  to  provide  different 
functionality  in  scheduling  agents,  and  different  language  and  run-time  support  facilities. 


3.5  Run-time  support:  Scheduling  agents 

Every  application  has  a  scheduling  agent  associated  with  it  at  run-time.  This  scheduling  agent  is  part  of 
the  run-time  support  environment  for  the  application,  and  is  an  enhancement  of  the  conventional  concept 
of  the  RTE  Whereas  conventional  RTEs  provide  functions  such  as  object  management  and  resolution 
of  method  invocations,  the  scheduling  agent  can  generalize  this  to  include  scheduling  functions  such  as 
resource  management  and  version  selection. 

The  application  dictates  the  functionality  to  be  provided  by  the  scheduling  agent.  For  example,  if 
the  application  uses  the  guarantee  primitive,  then  the  scheduling  agent  will  perform  resource  guarantee 
functions.  Similarly,  if  the  application  defines  several  different  versions  of  methods,  then  the  scheduling 
agent  will  include  version  selection  functionality.  Applications  designers  can  also  specify  that  certain 
functionality  should  or  should  not  be  included  in  the  scheduling  agent,  e.g.  whether  or  not  the  scheduling 
agent  should  perform  resource  substitution. 

The  functionality  provided  by  scheduling  agents  will  also  depend  on  the  programming  language.  For 
example,  a  scheduling  agent  for  Ada  would  include  support  for  those  PRAGMAS  which  are  implemented 
on  the  particular  system.  The  implementation  of  the  scheduling  agent  would  depend  on  the  capabilities 
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provided  by  the  underlying  OS.  Thus,  resourcf  substitution  functionality  can  only  be  provided  if  the  OS 
supports  the  resource  model. 

Figure  4  illustrates  the  operation  of  a  typical  scheduling  agent  for  a  real-time  application  which  re¬ 
quires  guarantees,  and  uses  imprecise  computation.  The  application  expresses  resource  requirements  for 
each  version  of  each  method.  When  the  application  requests  that  a  method  invocation  be  guaranteed, 
the  scheduling  agent  is  activated.  The  agent  interacts  with  the  OS  resource  managers  to  determine 
whether  all  the  resources  needed  by  the  application  are  available.  Based  on  this  information,  it  selects  a 
version  of  the  method  to  execute,  and  requests  each  resource  manager  involved  to  guarantee  its  resource 
requirements.  By  determining  resource  availability  in  advance,  the  agent  performs  the  coordination  func¬ 
tion  of  avoiding  situations  where  some  resources  are  guaranteed,  but  others  are  refused.  If  a  guarantee 
is  provided,  but  is  subsequently  violated  (due  to  faults  or  preemptions),  then  the  agent  itself  attempts 
to  handle  the  exception  situation  by  using  resource  substitution  or  selecting  an  alternative  version  to 
execute.  In  the  event  that  it  is  unsuccessful,  or  if  guarantees  were  refused  in  the  first  place  and  no  al¬ 
ternative  version  is  available,  the  agent  notifies  the  application  and  transfers  control  to  the  appropriate 
exception  handler.  If  the  application  does  not  provide  exception  handlers,  scheduling  agents  themselves 
provide  default  exception  handlers  (perhaps  simply  print  an  error  message  and  quit). 

Thus  scheduling  agents  can  provide  resource  management  functionality  for  the  application,  without 
adding  unnecessary  overhead  to  those  applications  which  do  not  require  the  functionality.  It  may  be 
argued  that  scheduling  agents  unnecessarily  duplicate  in  a  higher  layer  some  functions  which  can  be 
provided  in  the  OS  itself,  and  therefore  add  processing  and  memory  overhead.  However,  this  possible 
disadvantage  is  more  than  offset  by  the  elimination  of  the  unused  functionality  which  is  inevitable  when 
using  a  sophisticated  OS,  and  by  the  use  of  application  semantic  information.  For  example,  if  the  OS 
scheduler  supported  imprecise  computation,  and  were  to  perform  version  selection,  it  may  spend  time 
in  sorting  the  various  versions  in  order  of  decreasing  resource  requirements  and  increasing  result  quality, 
so  that  it  may  first  try  to  guarantee  the  better  versions,  and  if  they  fail,  then  try  poorer  versions. 
On  the  other  hand,  the  scheduling  agent  can  have  the  version  order  hard-coded  into  itself,  and  thus 
avoid  the  sorting  step  altogether.  We  can  also  avoid  replication  of  the  same  scheduling  agent  code  in 
different  applications  by  making  scheduling  agent  code  re-entrant,  and  sharing  the  code  among  different 
applications. 

3.6  Construction  of  scheduling  agents 

Scheduling  agents  are  generated  during  compilation  and  in  the  post-compilation  phase.  The  following 
information  is  used  in  the  generation  of  scheduling  agents: 

•  Resource  requirements  information  expressed  in  the  application. 

•  Resource  requirements  derived  from  schedulability  analysis. 

•  Application  semantic  information  expressed  using  the  language  constructs,  such  as:  multiple  ver¬ 
sions  of  methods,  guarantee  primitives,  PRAGMAS. 

•  System  configuration  information,  as  obtained  from  the  OS  at  agent  generation  time. 

•  Library  routines  which  implement  special  resource  management  techniques,  such  as  various  fault- 
tolerance  techniques,  version  selection  and  resource  substitution. 

•  Exception  handlers  defined  in  the  application. 
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Figure  4:  Example  of  Scheduling  Agent  Operation 
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•  Default  teaource  exceptioti  haiitlltiig  ttiuluir» 


•  Directives  from  the  appticalioii  programioef  »pev  -.U  fudrtu^-naiit)  tu  i/<-  to  i!ir  !>i  hrduS 

ii>g  agent 

The  scheduling  agent  combines  the  information  about  resoufce  recjuirnneuis  oiiiained  from  Un-  appii 
cation  and  from  achedulability  analysis,  to  ilrrive  the  res^iufcr  tci^uift-iitetus  ..f  each  applKatiou  iiieltn><i 
Using  the  resource  model,  the  resource  requirements  are  translated  into  yuarantre  requests  U>  indoid 
ual  OS  resource  managers  The  use  of  the  guanutti^**  primitise  tngg-rs  the  eaec ijiioti  uf  this  rcsourcr 
request  code  in  the  scheduling  agent  If  'lie  application  drlines  multiple  versions,  the  agent  perfurms 
version  selection  before  requesting  guarantees  Receiving  enepiion  noitfication  from  the  OS  triggers  ilie 
execution  of  special  routines  such  as  fault  handling  ami  resource  sul-slilution  I  he  inclusujn  of  these 
special  routines,  and  the  strategies  for  obtaining  guarantees,  performing  version  selection  etc  are  all 
dependent  on  the  directives  issued  by  the  application  (irogtammer  When  a  fault  notification  is  received 
the  scheduling  agent  first  attempts  to  execute  any  applicable  special  routine-  which  it  may  contain  If 
there  no  applicable  special  routines,  then  the  agent  triggers  the  execution  of  any  fault  handiers  defined 
by  the  application  for  the  situation  If  there  are  none,  then  it  executes  the  ilefauli  handler 

It  must  be  emphasired  that  scheduling  agents  can  be  built  for  any  programming  language  and  any 
operating  system,  to  perform  any  functions  desired  by  the  application,  as  long  as  tlie  mn  time  overliead 
of  providing  that  functionality  is  acceptable  to  the  application  In  R  Shell,  we  propnw  the  particular 
design  and  functions  outlined  here,  because  we  believe  that  these  are  parlicularly  appropriate  for  building 
object-oriented,  distributed,  faull-lolerant  real-time  applications  which  can  adapt  to  a  variety  of  dv  narriic 
resource  requirements  and  resource  availability  We  believe  that  the  (verformancr  of  this  approach  will  be 
comparable  to,  if  not  better  than,  that  of  system*  which  provide  erpnvalrnt  functionality  by  incorporating 
features  in  the  OS 


4  Conclusion 


The  primary  contribution  of  R-Shell  is  its  uv  r>f  scheduling  agents  to  integrate  the  resource  management 
functionality  that  is  conventionally  provided  separately  by  the  OS  and  the  application  Since  scheduling 
agents  have  access  to  system  configuration  infrrrnialion  as  well  .as  application  semantic  information,  they 
can  implement  sophisticated  resource  management  strategies  By  avoiding  the  inclusion  of  unnecessary 
functionality  in  applications  which  do  not  require  it,  scheduling  agents  ran  reduce  system  rci-  lexily  and 
improve  performance.  Scheduling  agents  piovide  a  flexible,  fully  di.stributed.  objeet-oriented  wlution  to 
the  problem  of  resource  management  in  real-time  systems 

In  addition  to  scheduling  agents,  R-Shell  also  includes  some  other  innovative  features  the  use  of  a 
resource  hierarchy  to  model  resource  characteristics,  enablir,g  applications  to  express  resource  require¬ 
ments  better;  explicit  interaction  between  applications  and  OS  to  obt-ain  guarantees  and  information 
about  resource  availability;  the  use  of  exception  notification  and  exception  handlers  for  fault  detection 
and  fault-tolerance;  and  the  technique  of  ri'smirce  substitution  for  dealing  with  resource  scarcity 
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Abstract^ — ^EJvent  based  state  transitions  are  central  to 
some  techniques  for  specification  of  real  time  systems. 
As  a  part  of  the  Software  Cost  Reduction  (SCR)  project 
at  the  Naval  Research  Laboratory,  an  event  descriptor 
notation  was  defined  in  which  events  are  described  in 
terms  of  changes  to  boolean  predicates.  Follow-on 
research  produced  other  models  and  definitions  of  the 
SCR  event  descriptor.  However,  all  of  these  definitions 
have  limitations  in  terms  of  their  ability  to  describe 
certain  useful  classes  of  events.  We  therefore  establish 
a  rationale  for  evaluating  event  descriptors  in  terms  of 
generality,  implementability,  and  verifiabiUty.  To  fulfill 
these  requirements,  we  propose  an  extended  event 
descriptor  which  allows  expression  of  a  larger  class  of 
events.  The  new  descriptor  has  the  added  advantage  of 
displaying  related  functional  and  timing  specifications 
together,  which  allows  easier  understanding  of  the 
meaning  of  an  event.  We  e^qilore  the  meaning  of  the 
event  descriptor,  both  in  terms  of  a  formal  definition 
and  the  code  which  is  required  to  implement  the 
specification. 

1.  Introduction 

In  software  specification  documents  for  retd  time 
systems,  functions  to  be  incorporated  in  the 
software  may  be  specified  such  that  their  output 
values  change  in  response  to  external  events.  Such 
functions  are  most  easily  specified  in  terms  of  state 
transitions,  where  the  transitions  are  triggered  by 
events.  An  example  would  be  a  function  which 
illuminates  a  warning  Ught  whenever  a  fluid  level 
exceeds  a  certain  amount  and  which  then 
extinguishes  the  fight  if  an  acknowledgment 
button  is  depressed  by  the  operator.  This  paper 
concerns  attempts  to  produce  a  notation  to  define 
and  describe  such  events  in  software 
specifications. 


This  work  was  supported,  in  part,  by  the 
National  Science  Foundation,  under  grant 
CCR*9057874  to  the  University  of  Maryland. 


Under  the  direction  of  D.  Pamas,  the  Naval 
Research  Laboratory’s  Software  Cost  Reduction 
(SCR)  project  provided  an  important  contribution 
to  formal  requirements  specification  of  real-time 
software  by  producing  a  formal  specification 
method  which  was  validated  by  application  to  a 
large  software  system:  the  already  existing  A-7 
aircraft  Operational  Flight  Program.  That 
specification,  in  its  final  form,  ran  to  approxi¬ 
mately  500  pages  [NRL  88].  As  a  part  of  that 
specification  method,  SCR  developed  a  notation  to 
describe  events,  using  an  extension  of  first  order 
logic  [HENI  80).  In  this  notation,  an  atomic  event 
is  described  by  the  notation  "(STCconditwii),"  where 
condition  is  a  boolean  predicate.  The  event 
described  by  the  At-True  expression  is  considered 
to  occur  at  any  instant  in  time  when  the  predicate 
condition  transitions  from  false  to  true.  "©P  indi¬ 
cates  the  converse.  The  notation  was  further 
extended  to  incorporate  a  guarding  condition,  in 
the  form  "@T(condftwn/)  WHILE  {condition2)'' . 
This  event  occurs  at  the  instant  when  conditioni 
transitions  from  false  to  true,  given  that 
condition2  is  true  at  the  same  time. 

The  description  of  the  @T  notation  in  the  previous 
paragraph  is  strictly  intuitive.  It  uses  the  words 
"instant"  and  "at  the  same  time."  However,  since 
computer  software  does  not  deal  in  infinitesimal 
instants  or  simultaneity,  no  program  can  monitor 
both  event  and  condition  at  the  same  time.  If  the 
notation  is  to  specify  the  behavior  of  computer 
software,  then  its  forma)  definition  must  be 
unambiguous  and  implementable  with  software, 
and  the  implementation  must  be  verifiable. 

We  propose  three  main  aspects  by  which  an  event 
descriptor  should  be  evaluated.  I’irst,  it  should  be 
general  enough  to  cover  a  class  of  useful, 
reasonable  events.  We  will  explore  certain  types  of 
events  that  require  a  more  expressive  descriptor. 
Second,  it  must  be  unambiguous  and  should  allow 
implementation  in  code  in  a  straightforward 
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manner.  Thus,  it  should  not  require  that  unduly 
complex  code  be  written  to  detect  simple  events. 
Third,  the  descriptor  should  enable  the  verification 
of  the  code  that  is  written  to  implement  it.  There 
should  be  a  practical  means  to  examine  the  code  to 
determine  that  it  does  in  fact  fulfill  the  specifica¬ 
tion.  These  can  be  difficult  standards  to  fiilfiU.  As 
we  will  demonstrate,  event  detection  in  code  is 
often  a  simple  matter,  perhaps  requiring  no  more 
than  a  single  condition  statement.  Yet  where 
complex  definitions  of  events  are  required,  the 
code  that  implements  those  events  may  become 
quite  intricate. 

In  section  2,  we  address  the  main  topic  of  this 
paper,  which  is  the  issue  of  generality,  or  the 
variety  of  occurrences  that  can  be  covered  by  the 
definition  of  event.  Section  3  covers  previous  work 
that  has  been  published  concerning  the  definition 
of  the  @T  event  descriptor.  In  Section  4,  we 
motivate  the  selection  of  an  improved  event 
descriptor  with  a  detailed  description  of  the 
requirements  that  the  descriptor  should  fulfill. 

Our  proposed  descriptor  is  presented  in  Section  5 
and  then  defined  and  implemented  in  code  in 
Section  6.  Section  7  covers  other  factors  involved 
with  implementation,  concerning  the  nature  of  in¬ 
puts  to  the  program.  In  Section  8,  we  evaluate  our 
descriptor  and  the  previous  work  in  terms  of  the 
proposed  requirements. 


2.  Generality  in  Event  Description 

2.1  Event  Sequencing  and  Timing 

Since  single  processor  systems  can  test  only  a 
single  external  condition  at  a  time,  and  communi¬ 
cation  and  resource  contention  prevent  distributed 
systems  firom  comparing  external  data  values  at 
the  same  time,  the  definition  should  imply  a 
specific  sequence  of  tests  upon  external  variables. 
It  is  this  need  for  a  sequence  of  testing  that  gives 
rise  to  the  most  fundamental  ambiguity  in  the 
"@T(A)  WHILE  (B)”  notation;  should  the  condition 
be  tested  before  the  event,  after  it,  both  before 
and  after,  or  either  before  or  after?  In  practice,  it 
is  useful  to  be  able  to  specify  any  of  the  four 
situations.  The  following  paragraphs  provide 
examples  in  which  each  of  the  four  would  be 
appropriate. 


Before.  Suppose  that  the  occurrence  of  the  event 
@T(A)  may  make  the  value  of  B  inaccessible.  For 
example,  if  A  is  an  explosion,  and  B  is  a  reading  of 
ambient  temperature  from  a  delicate  device 
adjacent  to  the  explosive,  B  may  be  invalid  once  A 
becomes  true.  In  this  case,  the  specification  writer 
Cspecifier")  should  be  able  to  dictate  that,  when  A 
transitions  firom  false  to  true,  the  WHILE 
condition  should  use  the  value  of  B  obtained  most 
recently  before  the  event  @T(A),  The  same  order 
would  also  apply  if  @TC4)  WHEN  (B)  initiates  a 
process,  @F(^  terminates  the  process,  and  it  is 
essential  that  the  process  be  initiated  in  cases 
where  the  same  external  occurrence  may  cause 
@F(B)  shortly  after  @T(A). 

After.  Suppose  B  is  valid  only  after  @T(A)  occurs. 
For  example,  if  B  were  the  output  of  a  peak¬ 
reading  pressure  meter  that  measured  the 
intensity  of  the  explosion  in  the  previous  example, 
it  would  be  essential  to  specify  that  B  be  checked 
only  after  the  event  @T(A).  Or,  in  the  second  ex¬ 
ample  of  part  the  previous  paragraph,  if  there  is  a 
safety  constraint  which  dictates  that  the  process 
not  Le  initiated  in  cases  where  @T(A)  and  @F(B) 
are  caused  by  the  same  external  occurrence,  the 
specifier  may  wish  to  dictate  that  the  value  of  B  be 
checked  after  @T(A). 

Both.  Suppose  there  are  two  types  of  events  that 
may  cause  @T(A).  One  type  will  affect  only  A;  the 
other  type  will  make  A  true  and  will  reverse  the 
value  of  B.  If  only  the  first  type  of  event  should 
trigger  the  function,  then  the  value  of  B  must  be 
checked  both  before  and  after  the  value  of  A. 

Either.  In  discussing  difficult  and  complex  cases, 
we  must  remember  that,  in  most  cases,  the 
sequencing  and  timing  of  event  detection  will  not 
be  critical.  Using  the  explosive-temperature 
example,  if  the  thermometer  were  placed  far 
enough  from  the  explosion  not  be  affected,  the 
sensing  of  the  explosion  and  checking  of  the 
temperature  could  occur  in  any  sequence.  The 
specification  might  only  require  that  the 
t'^mperature  value  be  sampled  at  a  time  reason- 
aoly  close  to  the  explosion  (i.e.  not  an  hour  prior). 

These  examples  may  also  be  extended  to  cover 
cases  where  indeterruinacy  in  data-arrival  timing 
dictates  that  the  value  of  B  be  examined  within  a 
certain  interval  prior  to,  or  after,  the  occurrence  of 
@T(A).  For  example,  in  the  measurement  of  the 
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explosive's  ambient  temperature,  we  may  require 
that  B  be  measured  no  earlier  than  one  minute 
prior  to  @T(A),  but  no  later  than  one  millisecond 
prior  to  the  event.  A  good  event  descriptor  should 
be  able  to  express  any  of  these  situations. 

2.2  Input  Behavior 

Another  consideration  is  the  ability  of  the 
specification  to  handle  ill-behaved  input  signals. 
Three  examples  of  such  difficult  behavior  that 
must  be  addressed  are  shown  in  Figure  1.  In  Case 
1,  a  digital  signal  is  produced  by  a  mechanical 
switch  which  suffers  fixjm  "bouncing"  as  it  is 
turned  on.  The  signal  oscillates  for  a  time,  before 
settling  into  its  new  state.  The  event  @T(Switch  = 
On)  would  then  be  detected  four  times,  for  a  single 
movement  of  the  switch.  In  Case  2,  an  inherently 
analog  signal,  temperature,  is  being  measured 
with  an  accuracy  (±.  15)  worse  than  the  precision  of 
the  digital  input  (±0.05).  Thus,  the  digital  signal 
displays  random  oscillation  about  its  true  value 
(indicated  by  the  dotted  Une).  If  the  event  of 
interest  were  @T<Temp  S  98.4),  this  oscillation 
could  cause  the  event  to  be  detected  multiple  times 
during  a  period  when  the  temperature  was 
unchanging.  @T(Temp  S  98.7)  would  be  detected 
twice,  even  though  the  true  value  was  constantly 
increasing.  In  Case  3,  the  temperature  is  being 
measured  at  a  better  level  of  accuracy  (±0.01)  than 
the  digital  precision  (±0.05).  The  signal  therefore 
tends  to  oscillate  about  the  true  value.  A 
temperature  of  98.42  results  in  an  oscillating 
digital  value  that  spends  60%  of  the  time  at  98.40 
and  40%  at  98.45.  Even  with  a  constantly 
increasing  temperature,  the  digital  input  may 


show  brief  decreases.  In  the  figure,  such  an 
oscillation  would  cause  the  event  @T(Temp  ^  98.6) 
to  be  detected  twice. 

Thus,  in  any  of  these  three  cases,  an  @T  expres¬ 
sion  might  be  triggered  many  times  due  to  an 
oscillating  or  "bouncing"  value.  If  we  are  to  allow 
for  system  designs  that  require  the  above  events  to 
only  occur  only  once  in  each  such  case,  we  will 
need  to  extend  the  descriptor  to  provide  for  more 
complete  handing  of  the  inputs. 

3.  Previous  work 

The  SCR  project  [NRL  88]  formally  defined  the  At- 
True  notation  as  follows: 

This  notation  calls  for  the  evaluation  of  a 
condition  simultaneously  with  the  detec¬ 
tion  of  an  event.  Since  this  is  not  possible 
in  any  real  implementation,  we  define  the 
meaning  of  an  expression  such  as  @T(10 
WHEN  (Y)  thus:  Let  tg  be  the  time  at 
which  @T(X)  is  detected  by  the  software. 

Then  there  is  a  time  interval  e  such  that  if 
Y  is  true  (false)  for  all  of  |  tg-e,  tg  I ,  or  for 
afi  of  Itg,  tg+s),  or  for  all  of  |tg-c,  tg+s|, 
then  the  expression  is  (is  not)  considered  to 
be  satisfied.  The  behavior  of  the  system 
when  Y  is  true  for  only  part  of  the  interval 
is  not  defined;  it  may  or  may  not  behave  as 
though  the  expression  were  satisfied. 

Which  interval  we  use  as  the  requirement, 
as  well  as  the  value  of  e,  is  defined  in  the 
Accuracy  chapter  [of  the  SCR  sjjecifica- 
tions]  for  each  such  event. 
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This  definition  acknowledges  the  impossibihty  of 
simultaneous  evaluation  of  conditions  and  events 
and  permits  the  specifier  to  choose  whether  the 
condition  should  be  true  before,  after,  or  before 
and  after,  the  software  detects  the  event.  The 
specifier  may  also  choose  the  interval  (epsilon) 
over  which  the  condition  must  be  true.  However, 
this  definition  has  several  weaknesses.  Most 
importantly,  it  fails  to  directly  relate  the  behavior 
of  the  event  variable  (X)  to  the  WHEN  variable 
(Y).  Ys  behavior  is  related  to  the  time  when  the 
event  is  detected  by  the  software,  rather  than  the 
time  when  X's  value  changes.  The  delay  that  is 
allowable  in  detecting  the  event  and  the  rate  at 
which  the  event  variable  must  be  checked  are  not 
directly  specified  at  all  in  SCR.  Only  where  the 
event  triggers  the  performance  of  a  function  is  a 
"Maximum  Delay  to  Completion"  given  for  that 
function.  The  placement  of  the  timing  constraints 
in  a  separate  chapter  makes  it  difficult  to 
understand  the  correct  meaning  of  an  event.  There 
is  no  provision  for  avoiding  spurious  events 
resulting  firom  bouncing  or  oscillating  inputs.  The 
definition  does  not  make  any  attempt  to  deal  with 
situations  where  the  WHILE  condition  may  be  ex¬ 
pected  to  change  shortly  before  or  after  the  event. 
No  provision  is  made  to  specify  that  the  condition 
be  checked  both  before  and  after  the  event. 

[PAUL  89]  provides  the  specification  method  of 
SCR  with  both  a  theoretical  basis,  in  finite  state 
automata,  and  a  practical  means  of  translation 
into  code,  using  State  Transition  Event  (STE)  syn¬ 
chronization  of  cooperating  sequential  processes. 
This  mechanism  is  intended  to  supply  the 
temporal  ligor  required  of  Hard  Real  Time 
embedded  systems  and  to  address  the  often 
complex  issue  of  process  scheduling  that  can  arise 
in  an  SCR  specification.  This  dissertation  makes  a 
very  important  point  on  the  ordering  of  events. 
Whether  due  to  the  sequential  nature  of  single 
processors,  or  communication  delays  in  multiple 
processor  architectures,  there  is  always  an 
indeterminacy  in  the  detection  of  events.  Only  a 
partial  ordering  of  events  is  available  in  a 
computer.  Therefore,  his  definition  of  finite  state 
automata  is  extended  to  incorporate  a  relation 
describing  near-simultaneity.  However,  the 
specific  definition  of  the  @T  construction  is  not 
addressed. 

[SCHO  90]  also  contributes  to  the  theoretical  un¬ 
derpinning  for  the  SCR  method.  Event  classes  are 


defined  as  instants  in  time  when  a  predicate  P  on 
an  environmental  state  function  s  and  time  t  is 
true  [P(s,t)].  A  mechanism  is  provided  for 
expressing  the  ideal  behavior  of  a  system,  with 
acceptable  deviations  (in  time  and  precision)  from 
the  ideal.  A  grammar  is  provided  for  the 
generation  of  "event  class  forms,"  including  At- 
True  expressions.  The  following  definition  of 
"@T(P2)  when  (P2)"  is  given: 

EC(3e[(E>0)AV8[(0<5^)-> 

(-nP  l(S,t-8)AP2(S,t-5)AP  I  (8,t)B) 

where  EC  is  the  Event  Class  consisting  of  all  times 
t  where  the  expression  evaluates  to  true.  In  other 
words,  Pj  must  have  been  false,  and  P2  true  for  all 
of  some  interval  before  Pj  became  true.  Unlike 
SCR,  this  definition  does  precisely  state  which  be¬ 
havior  of  both  variables  constitutes  an  event. 
However,  only  intervals  before  the  event  are 
supported.  In  other  areas,  this  definition  suffers 
from  the  same  lack  of  flexibility  as  the  SCR 
descriptor. 

4.  Rationale  for  Choosing  a  Definition  of  ®T 

As  mentioned  in  the  introduction,  we  believe  that 
an  event  descriptor  should  achieve  the  objectives 
of  generality,  implementabihty,  and  veri^bility. 
We  now  explore  these  criteria  in  detail.  In  section 
2,  we  demonstrated  the  need  to  express  limitations 
on  events  concerning  the  sequencing  of  inputs  and 
bouncing  or  oscUlating  inputs.  We  contend  that,  to 
fulfill  the  criterion  of  generality,  an  event 
descriptor  should  be  able  to  specify  input 
sequencing  whenever  it  is  desired.  There  should 
also  be  provision  for  useful  specifications  dealing 
with  the  ill-behaved  inputs  presented  in  section 
2.2. 

Regardless  of  the  intuitive  and  theoretical 
definitions  of  the  At-True  notation,  if  it  is  to  be 
used  in  the  specification  of  software,  then  its  value 
lies  in  the  possibility  of  implementation  in  code. 
Thus,  the  definition  of  the  notation  should  be  ap¬ 
propriate  to  the  goal  of  producing  code,  not  to  any 
abstract  sense  of  neatness  or  symmetry.  It  should 
completely  describe  the  circumstances  under 
which  the  code  must,  may,  and  may  not  detect  an 
event.  Furthermore,  the  code  implied  by  the  use  of 
the  notation  should  not  be  any  more  complex  than 
the  problem  requires.  In  most  cases,  even  in  hard 
real  time  systems,  only  a  minimum  polling  rate  for 
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the  inputs  needs  to  be  specified,  and  exact  timing 
and  sequencing  of  the  polling  is  not  a  specification 
issue.  Yet,  in  those  instances  where  the  relevant 
external  variables  are  not  independent,  or  the  in¬ 
puts  are  ill-behaved,  the  specifier  should  be  able  to 
express  more  exact  requirements.  Some  of  the  ad¬ 
ditional  complexity  involved  in  epsUon  intervals 
can  be  justified  in  the  case  of  hard  real  time 
systems,  where  temporal  precision  is  crucial. 
However,  the  "hard  real  time"  qualifier  means  only 
that  the  required  functions  must  always  be 
completed  in  the  specified  time.  It  does  not 
necessarily  imply  that  the  drawing  of  distinctions 
between  event  ordering  need  be  any  more  or  less 
precise  than  in  non-  hard  real  time  systems. 

In  accordance  with  this  reasoning,  we  maintain 
that  an  event  descriptor  fulfills  the  requirement  of 
implementabUity  if  it  completely  defines  the 
circumstances  when  events  will  be  detected.  In  its 
least  complex  form,  the  descriptor  should  be 
implemented  by  code  that  performs  no  more  than  a 
straightforward  check  of  the  input  values.  It 
should  also  be  practical  to  implement  the  detection 
of  more  complex  event  descriptions.  Further,  the 
definition  must  be  applicable  for  cases  in  which 
any  atomic  predicate  of  the  At-True  notation  is 
either  polled  or  interrupt  driven.  In  this  paper, 
"interrupt  driven"  refers  only  to  those  input  data 
items  that  result  in  immediate  suspension  of  code 
execution  and  branching  to  separate  code  (possibly 
pre-empted  by  higher  priority  interrupts).  Note 
that  this  definition  of  "interrupt  driven"  does  not 
include  data  items,  such  as  buffered  keyboard  in¬ 
puts,  that  cause  system-level  interrupts  without 
changing  the  order  of  execution  of  statements  in 


the  software  being  specified.  At  the  level  of  the 
software  being  specified,  the  data  item  is 
interrupt-driven  only  if  the  software  itself  "hooks" 
the  interrupt. 

Finally,  we  propose  the  goal  of  verifiabUity.  Once 
the  code  is  written  to  implement  an  event 
descriptor,  it  should  be  possible  to  verily  the 
correctness  of  that  implementation  by  identilying 
the  code  elements  that  fulfill  the  functional  and 
temporal  requirements  of  the  definition.  We  shall 
provide  an  example  of  such  a  verification  at  the 
end  of  the  next  section. 

To  depict  the  fulfillment,  of  these  criteria,  we  shall 
use  the  tabular  form  as  shown  at  the  bottom  of  the 
page. 

As  previously  noted,  the  original  SCR  descriptor 
definition  provides  for  all  of  the  variable-checking 
sequences  except  "both."  The  Event  Class 
definition  allows  only  the  "before"  sequence. 
Neither  of  these  definitions  provide  the  means  to 
handle  ill-behaved  inputs  or  time  delays  before 
and  after  the  event.  The  original  definition  is  in¬ 
complete,  as  it  does  not  address  when  the  event 
variable  itself  should  be  checked.  Event  Classes 
correct  this  omission.  Both  of  these  definitions 
allow  simple  code  to  be  generated  to  detect  events. 
Since  they  do  not  address  complex  input  behavior 
and  polling  and  interrupts,  they  are  not  rated  in 
these  areas.  Due  to  its  vague  description  of  event 
detection,  and  its  relegation  of  timing  constraints 
to  a  separate  chapter,  the  SCR  descriptor  does  not 
fulfill  the  verification  criterion.  It  is  unclear 
whether  Event  Classes  would  improve  upon  this. 
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6.  The  Extended  Event  Descriptor 
5.1  Extensions 


The  requirement  of  generality  will  be  provided  for 
by  a  series  of  extensions  of  the  @T  notation.  By 
subscripting  the  @T  and  WHEN  symbols  with 
numbers  and  symbols  representing  temporal 
limits,  the  exact  nature  the  timing  constraints 
can  be  specified.  The  most  elementary  form  of  our 
new  descriptor  consists  of  the  basic  SCR  format 
with  a  time  subscript  added  to  the  event;  &rjA) 
WHEN  (B).  This  form  requires  that  @T(A)  be  de¬ 
tected  if  A  remains  false  for  at  least  x  units  of  time 
and  then  remains  true  for  at  least  x  units  of  time. 
The  value  of  B  may  be  checked  at  any  time  within 
X  units  of  time  before  or  after  the  time  when  A 
changes  value.  Assuming  that  both  values  are 
polled,  any  scheme  that  polls  A  at  intervals  of  no 
more  than  x  would  be  acceptable.  B  may  either  be 
polled  regularly  at  intervals  of  no  more  than  x,  or 
it  may  be  poUed  immediately  after  @T(i4)  is 
detected.  This  allows  the  programmer  maximum 
flexibility  in  implementing  the  specification.  This 
form  of  the  descriptor  is  equivalent  in  definition  to 
the  event  classes  of  [SCHO  90],  with  the  epsilon 
interval  applying  to  both  sides  of  the  event. 


To  allow  for  the  "bouncing"  of  inputs,  we  provide 
our  next  extension  to  the  notation.  The  subscript 
after  the  @T  portion  may  consist  of  two  comma 
separated  elements,  separately  describing  the 
times  that  the  expression  must  remain  false,  and 
then  true.  For  example,  @T  (A)  WHEN  (B)  would 
indicate  that  the  event  mustoe  detected  whenever 
A  is  false  for  at  least  x  time  and  then  true  for  at 
least  y  time.  The  bouncing  mechanical  switch  can 
be  accommodated  by  this  notation,  where  x  and  y 
may  be  set  to  the  minimum  amount  of  time  that 
the  switch  will  remain  in  each  state  before 
changing.  So  long  as  the  "settling"  time  for  the 
switch's  bouncing  is  significantly  less  than  x  and  y, 
this  will  ensure  that  only  one  event  is  generated 
for  each  movement  of  the  switch. 


When  the  inputs  oscillate,  as  in  cases  2  and  3  of 
Figure  1,  the  @T  notation  must  be  further 
extended  by  providing  a  pre-condition  for  the 
event.  @T^j  '=>  A)  WHEN  (B)  indicates  that 
the  event  must  be  detected  whenever  the 
precondition  A '  is  true  for  at  least  xl  time,  and 
then  A  is  true  for  at  least  x2  time.  For  the  example 
of  Case  2  of  figure  1,  if  we  wanted  to  specify  an 


event  in  which  temperature  increases  past  98.7 
degrees,  we  could  ensure  that  spurious  events 
would  not  be  detected  by  specifying  @T^  ^dTemp  s 
98.4]=>(Temp  >  98.7]).  Hiis  event  must  be  detected 
any  time  the  temperature  stays  at  or  below  98.4 
for  at  least  2  units  of  time  and  subsequently 
increases  past  98.7  for  at  least  1  unit  of  time.  It 
must  not  be  triggered  until  after  the  temperature 
is  found  to  be  less  than  or  equal  to  98.4  for  at  least 
one  sample.  Similarly,  the  requirement  of  Case  3 
may  be  satisfied  by  @T|  ^(rremp  s  98.5}=>(Temp  i 
98.6]). 

The  next  extension  for  the  notation  aUows  the 
speciGcation  of  the  sampling  order  of  the  WHEN 
clause.  A  subscript  after  the  WHEN  clause  can 
contain  a  relational  sign  (<  or  >)  and,  optionally,  a 
number  representing  a  time  delay.  In  understand¬ 
ing  this  use  of  the  relational  signs,  keep  in  mind 
that  they  are  used  to  delineate  the  sampling  order 
of  the  clauses.  Thus,  @T(A)jj  WHEN<(B)  requires 
that  B  be  sampled  after  the  event  @T(A)  has 
occurred.  ©T^CA)  WHEN<y(B)  indicates  that  B 
should  be  sampled  at  least  y  units  of  time  after 
@T(A).  In  both  cases,  B  should  be  sampled  no  later 
than  x  units  of  time  after  the  event.  Multiple 
WHEN  clauses  may  be  used  if  different  sampling 
orders  are  required  for  different  conditions. 

5.2.  Polling  and  Interrupts 

This  completes  our  extension  of  the  event 
descriptor.  A  formal  definition,  along  with  coding 
examples,  follows.  First,  however,  another 
dimension  of  the  sampling  issue  must  be  addressed 
to  fulfill  the  goal  of  implementability — that  of  the 
interaction  between  input  type  (polled  or 
interrupt)  and  sampling  sequence.  To  illustrate 
this  interaction,  we  will  investigate  a  series  of 
examples  based  upon  an  event  specified  by  @Tjj(A) 
WHEN>(B)  WHEN<(C).  This  specification  dictates 
that  B  must  be  sampled  before  the  event,  and  C 
after. 

If  A,  B,  and  C  are  polled  variables,  this  specifica¬ 
tion  is  most  easily  implemented  by  conducting  a 
poll  of  all  three  variables  every  x  units  of  time, 
using  the  order  B-A-C.  The  event  @Tjj(A)  is 
detected  whenever  successive  polls  of  A  result  in 
values  of  false  and  then  true.  Since  the  exact 
timing  of  the  external  event  cannot  be  determined 
within  the  interval  between  the  two  polls  of  A,  the 
value  used  for  B  must  be  checked  before  the  first 
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poll  of>l  (and  retained  for  a  full  polling  cycle),  and 
C  must  be  checked  after  the  last  poll  of  is 
polled,  but  B  and  C  are  interrupt  driven,  the  same 
effect  is  achieved  by  storing  the  value  of  B  for  one 
polling  cycle,  and  sampling  C  after  polling  A. 


If  i4  is  interrupt  driven,  as  illustrated  in  Figure 
3,  the  timing  of  the  event  <MXA)  will  be  better 
known.  Where  B  and  C  are  polled  values,  the 
last  poll  of  B  prior  to  the  event  will  be  used,  and 
C  w^  be  polled  as  part  of  the  interrupt  routine. 
If  B  is  interrupt  driven,  A  must  have  a  higher 
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This  is  illustrated  in  Figure  2.  As  time  elapses 
(from  left  to  right),  the  software  conducts 
regular  polls  of  B,  A,  and  C,  in  that  order.  The 
successive  values  of  false  and  true  for  A  indicate 
that  the  event  @T(A)  occurred  sometime  within 
the  shaded  zone.  Therefore,  the  values  of  B  and 
C  in  the  shaded  zone,  whether  true  or  false 
(indicated  in  the  figure  by  "T/F"),  are  not  useful; 
the  circled  values  must  be  used.  Although  there 
is  a  time  delay  inherent  in  the  software's  access 
to  external  data,  in  this  example  that  delay  is 
relatively  constant  for  all  data. 


interrupt  piriority,  to  prevent  changes  to  the 
previous  value  of  B  during  A 's  interrupt  han¬ 
dling  routine.  Conversely,  C  must  have  a  higher 
interrupt  priority  than  A,  since  A'a  interrupt¬ 
handling  routine  must  be  able  to  receive  an 
updated  value  of  C. 

The  above  reasoning  is  aU  that  is  needed  in  cases 
where  the  required  intervals  between  the  event 
and  the  sampling  of  B  and  C  are  less  than  a 
processor  instruction  cycle.  Where  that  is  not  the 
case,  either  due  to  temporal  indeterminacy  or 
external  relations  between  the  conditions,  delays 
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must  be  inserted  into  the  sequence  of  variable 
value  checks,  using  numeric  values  with  the 
WHEN  expressions.  Figure  4  illustrates  the 
specification  @T^(4)  WHEN^yCB)  WHEN.,^(C). 

In  Figure  4  the  time  intervals,  representing 
required  delays  and/or  temporal  uncertainties, 
have  further  spread  the  time  interval  over  which 
the  real'World  event  may  be  considered  to  have 
occurred,  relative  to  the  B  and  C  conditions.  The 
polling  of  B  and  C  must  be  separated  from  the 
polling  of  A  by  at  least  the  amount  of  the  delay 
values. 


6.  Definition  and  Implementation 
We  have  asserted  that  the  event  notation  should 
be  defined  in  terms  of  the  code  that  it  requires. 
Therefore,  for  a  given  event  expressed  in  our  no¬ 
tation,  we  define  when  the  code  must,  may,  and 
must  not  detect  an  event,  based  upon  the 
information  available  to  the  softwau^  system. 
Predicates  will  be  expressed  with  a  time  parame¬ 
ter;  P(t)  is  true  iff  a  value  polled  (or  read  from 
memory,  if  interrupt  driven)  at  time  t  makes  P 
true.  This  definition  is  described  in  terms  of  time 
intervals,  delineated  by  nine  points  in  time,  as 
illustrated  in  Figure  5. 
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Definition 

The  fully  extended  event  notation  A) 

WHEN^(B)  WHEN^(C),  where  xl>0  A  x2>0  A  yS 
0  A  2S0  A  y<x2  A  z<x2  A  [(A*  A  A)  ->  false],  is  de¬ 
fined  as  follows: 

Parti 

the  code  must  detect  an  event  exactly  once  for  any 
time  interval  (t7,t9)  where: 

3(tl,t2.t3,t4,t5,t6,t8)  such  that  tl+xlst2  A  t2^t6  A 
t3+xl^t5  A  t4+x2st6  A  t5+y5t6  A  tG+zSt?  A  t6+x2 
^t8  A  t7+x2^t9  and 

Vt[(tl^t<t2)->A’(t)]  and 
precondition  satisfied} 

Vt[(t2:St<t6)->-iA(t)]  and 

{A  is  false  after  precondition  but  before  t6} 
VtI(t4^t<t6)-»-TA(t)l  and 

{A  is  false  for  at  least  interval  x2} 
Vt{(t3:St<t5)-+B(t)]  and 

{B  is  true  for  at  least  interval  xl} 
Vt[(t5:St<t6)-*-iA(t)]  and 

{A  is  false  during  interval  y  before  tS} 
Vt[(t6st<t8)-»A(t)]  and 

{A  is  true  for  at  least  interval  x2  after  tS) 
Vt((t7<t^t9)->C(t)]. 

{C  is  true  for  at  least  interval  x2} 

Part  2 

The  code  may  detect  an  event  during  any  interval 
(t7,t9)  where 

3t[(tlst<t2)  -¥  A'(t)]  and 

precondition  satisfied  at  least  momentarily} 
Vt',t"{[tlit’<t’'St2  A  t"-t'Sx2] 

3t[(t'£t<t")  -*  A'(t)]}  and 
precondition  satisfied  at  least  once  during 
every  interval  of  length  x2  between  tl  and  12} 
3t[(tlst<t6)  — iA(t)]  and 

{A  is  false  at  least  momentarily  before  t6} 
3t[(t3^t<t5)  -♦  B(t)3  and 

{B  is  true  at  least  momentarily} 

3tKtfet<t8)  -►  A(t)]  and 

{A  is  true  at  least  momentarily  during  interval 
x2  after  t6} 

3tl(t7<tst9)  -+  C(t)]  and 

{C  is  true  at  least  momentarily  during 
interval  x2} 


the  figure  and  in  Part  1  occur,  the  code  must  de¬ 
tect  the  event  once.  That  detection  can  occur  no 
earlier  than  time  t7  and  should  be  no  later  than 
shortly  after  t9.  PART  2  ensures  that  each 
condition  is  true  at  least  once  during  each  interval 
during  which  the  code  should  check  the  value  of 
that  condition.  If  so,  the  code  may  detect  the  event. 
This  is  equivalent  to  the  circumstances  in  which 
the  original  SCR  descriptor  was  "undefined.”  If 
neither  Part  1  nor  Part  2  are  true,  the  code  must 
not  detect  the  event.  Note  that,  if  the  precondition 
interval  (xl)  is  longer  than  the  basic  polling 
interval  (x2),  then  Part  2  allows  detection  of  the 
event  only  if  the  precondition  is  true  at  least  once 
during  every  interval  of  length  x2  between  tl  and 
t2.  This  has  the  effect  of  requiring  Uiat  the  code 
check  the  precondition  as  often  as  it  checks  the 
other  conditions.  Note  that  the  time  interval  ftom 
tl  to  t2  must  occur  before  t6,  but  no  ordering  is 
required  between  these  times  and  t3  and  t4.  In 
other  words,  the  precondition  is  satisfied  if  its 
expression  is  true  for  an  interval  of  xl  at  any  time 
prior  to  A  becoming  true. 


The  following  example  is  a  Pascal  program  that 
implements  the  expression  @T,j  352(A'=>  A) 
WHEN_(B)  WHEN<^(C),  using  polling  of  A,  B, 
and  C.  Bor  reasons  of  brevity,  this  code  is  based 
upon  the  assumtions  that  y+z  x2  and  that  xl 
x2.  If  these  assumptions  were  not  justified,  the 
code  would  become  more  complex.  An  extension 
this  code,  relaxing  the  assumption  to  allow  xl  ^ 
is  included  as  an  appendix,  in  Section  11  of  this 
paper. 


of 

x2. 


Part  1  of  the  definition  ensures  that  each 
condition  will  be  true  throughout  the  appropriate 
time  interval.  So  long  as  the  criteria  depicted  in 
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001  program  RealTimeProg; 

002  var 

003  A,  AFrlme,  B,  C  :  Boolean; 

004  TimeNow,  NextTime,  XI,  X2,  Y,  Z  :  TimeType; 

005  const 

006  SmallTime  ;  TimeType  =  {SmallValue) ;  {See  explanation  below) 

007 

008  {functions  Time,  Initialize,  Delay,  PollA,  PollB,  PollC,  and  PollAPrime 
009  and  procedure  ReportEventDetected  not  shown) 

010 

Oil  function  AtTrueABC  :  Boolean;  (Returns  true  if  event  occurs) 

012  var 

013  PrevA,  PrevB  :  Boolean; 

014  begin 

015  PrevA  :=  A;  (Save  previous  values  of  A  &  B) 

016  PrevB  :=  B; 

017  PollB(B);  (Poll  A,  B,  C  with  delays  of  Y  and  Z\ 

018  Delay ( Y); 

019  if  not  APrime  then  PollAPrime (APrime) ;  (If  APrime  true,  don't  poll) 

020  PollA  (A)  ; 

021  Delay (Z); 

022  PollC (C); 

023  AtTrueABC  !=  APrime  and  not  PrevA  and  A  and  PrevB  and  C; 

024  if  A  then  APrime  ;=  false;  {Reset  pre-condition) 

025  end; 

026 

027  begin  {Main  body  calls  AtTrueABC  at  intervals  of  (X2-SmallTime) ) 

028  Initialize (XI,  X2,  Y,  Z);  {gets  timing  data) 

029  APrime  :=  false;  {set  initial  values) 

030  A  :=  true; 

031  B  ;=  false; 

032  NextTime  ;=  Time  +  X2  -  SmallTime; 

033  repeat 

034  if  AtTrueABC  then  ReportEventDetected; 

035  repeat  TimeNow  ;=  Time  until  TimeNow  >=  NextTime; 

036  NextTime  :=  NextTime  +  X2  -  SmallTime; 

037  until  false; 

038  end 


To  fulfill  the  definition,  the  choice  of  the  value  of 
constant  SmallTime  in  the  code  is  crucial.  Even  if 
we  could  be  assured  that  A,  B,  and  C  would  be 
polled  at  intervals  of  exactly  x2,  the  requirements 
for  intervals  y  and  z  could  not  be  fulfilled.  This  is 
because  the  value  of  A  might  have  changed  firom 
false  to  true  at  any  time  between  two  successive 
polls.  Thus,  we  must  add  delays  of  at  least  y  and  z 
when  polling  B  and  C,  respectively.  Yet  the  values 
of  B  and  C  are  not  required  to  remain  valid  for  any 
longer  than  y  and  z.  That  would  mean  the  delay 
timing  for  y  and  z  would  have  to  be  perfect,  which 
is  not  achievable  with  software.  Therefore, 
SmallTime  is  used  to  increase  the  polling  rate 
enough  to  ensure  that  B  and  C  are  polled 
sufficiently  close  to  the  time  that  A  changes  firom 
false  to  true.  SmallTime  must  be  greater  than  the 
maximum  amount  of  error  that  may  occur  in  the 


timing  of  the  polls.  Specifically,  SmallTime  must 
be  greater  than  the  maximum  delay  between  the 
polling  of  B  and  A,  or  A  and  C,  in  excess  of  y  and  z, 
plus  any  variability  between  successive  calls  of 
AtTrueABC.  If  a  v^ue  of  SmallTime  that  is  much 
smaller  than  the  value  of  x2  cannot  be  found,  then 
the  specifications  are  not  feasible. 

In  order  to  validate  the  above  code,  we  show  that: 

*  the  fulfillment  of  PART  1  of  the  definition 
implies  that  the  code  will  detect  the  event 

*  the  detection  of  an  event  by  the  code  implies 
the  fulfillment  of  PART  2  of  the  definition. 
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Informal  Proof 

Past  1  Event  Detected  Once 
The  code  will  detect  an  event  whenever  the 
conjunction  APrime  and  not  PrevA  and  A  and 
PrevB  and  c  in  line  023  evaluates  to  true. 
Therefore,  we  show  that  the  truth  of  each  element 
of  the  conjunction  in  line  023  follows  from  the 
definition.  Since  AtTrueABC  (line  Oil)  is  called 
every  x2  •  SiaallTime  units  of  time,  there  exists 
exactly  one  time  T :  t6  ^  T  <  t6+x2  •  SniallTime 
when  line  020  is  executed. 

A  A  is  polled  at  time  T.  Since  t6  :S  T  <  t6+x2  - 
SmallTime  <  t8,  A  is  true  at  time  T. 

C  C  is  polled  after  T.  and  after  the  delay  of  line 
021.  Let  this  be  time  c.  The  deLiy  will  be  at 
least  z,  and  less  than  z  +  Smalltime.  Since 
SmallTime  <  x2,  we  have  t7  =  t6  +  z<:T  +  zic 
<  T  +  z  +  SmallTime  <  t6+x2  +  z  =  t9. 
Therefore,  C  is  true  at  time  c. 

APrime  According  to  the  definition,  A  is  true  for 
an  interval  of  at  least  xl,  at  some  time  prior  to 
A  becoming  true.  By  the  program  assumption, 
xl  2  x2,  so  there  must  be  a  time  p  when  line 
019  is  executed  such  that  tl  +  SmallTime  £  p 
£  t2,  making  APrime  true.  APrime  would  then 
remain  true  until  A  is  polled  and  found  to  be 
true,  and  line  024  is  executed.  This  wiU,  of 
course,  happen  shortly  after  time  T,  but  only 
after  the  execution  of  line  023,  which  causes 
the  report  of  the  event.  Let  pA  be  the  time  of 
the  last  poll  of  A  prior  to  time  T.  By  the  defini* 
tion  of  T  above,  pA  <  t6.  Since  SmallTime  <  y, 
pA  >  t3.  Therefoie,  A  is  fake  at  time  pA  Any 
other  polk  of  A,  between  times  p  find  pA, 
happen  after  time  tl  +  SmallTime  and  before 
to.  Therefore,  all  such  polk  of  A  are  fake,  and 
APrime  will  be  true  when  line  023  k  evaluated 
after  time  T. 

not  PrevA  The  reasoning  k  the  same  as  above 
for  time  pA 

PrevB  Thk  is  the  value  of  B  obtained  in  the 
previous  call  to  AtTrueABC.  Let  the  time  of 
that  poll  be  b.  By  the  specification  of 
SmallTime,  we  are  assured  that  (T  •  x2  -  y)  <  b 
5  (T  •  x2  +  SmallTime  -  y).  Since  t6  ^  T  < 
t6+x2  -  SmallTime,  we  have  t3  <  b  <:  t5. 

Given  the  above  valuations,  line  023  will  evaluate 
to  true,  causing  the  event  to  be  detected  upon  the 
return  firora  the  function  AtTrueABC.  Since 


APrime  k  reset  to  fake  prior  to  that  detection, 
there  will  be  only  one  detection  of  the  event  until 
the  appropriate  conditions  are  fulfilled  once  again. 

Event  Detected  Part  2 

Again,  let  T  be  the  time  that  line  020  executes, 

immediately  prior  to  the  detect'  ■'*  of  the  event. 

Let: 

t6  =  T; 
t7  =  t6  +  z; 
t8  =  t6  +  x2; 
t9  =  t6  +  z2  +  z; 
tf'  =  t6  -  y: 
t4  =  t6  -  x2; 
t3  =  t6  •  y  -  xl; 
t2  =  t6 

tl  =  Min[t2  •  xl,  (the  time  when  APrime  was 
set  to  true)  •  SmallTime] 

These  values  satkfy  the  inequabties  on  tl-t9  from 
the  definition.  Now  we  produce  the  times  referred 
to  in  Part based  upon  the  poUs  of  the  values  that 
satkfied  the  conjunction  in  line  023: 

3t[(tl£t<t2)  A  A’(t)]  and 

[Let  t  =  (the  time  when  APrime  was  set  to 
true  by  line  019)] 

3t[(tlst<t6)  A  -iA(t)]  and 

[Let  t  =  (the  time  of  the  previous  execution 
of  line  020  prior  to  t6)] 

3t[(t3st<t5)  A  B(t)]  and 

[Let  t  =  (the  time  of  the  previous  execution 
of  line  017  prior  to  t6)] 

3t[(t65t<t8)  A  A(t)]  and 
[Lett  =  ’11 
3t[(t7<tSt9)  A  C(t)]. 

[Let  t  =  (the  time  of  the  execution  of  line 
022  after  t6)] 

Thk  satkfies  both  parts  of  the  definition;  thus,  the 
program  implements  the  definition. 

7.  Other  Implementation  Issues 

The  example  program  has  implemented  the  fully 
extended  form  of  the  event  descriptor  for  polled 
variables.  If  A  were  interrupt  driven,  the  code  and 
vabdation  would  be  more  straightforward.  The 
only  added  complexity  would  be  that,  when  the 
value  of  B  changes,  the  old  value  and  the  time  of 
its  change  must  be  saved.  Then,  in  evaluating 
whether  the  event  has  occurred,  the  old  value  of  B 
must  be  used  if  time  y  has  not  elapsed  since  the 
change. 
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Since  the  extended  event  descriptor  ut  intended  to 
replace  the  SCR  descriptor  and  definition,  we  show 
that  it  can  still  express  the  simpler  form  of  the 
SCR  notation.  TTiis  is  accomplished  by  noting  that 
[@T.,(-,A=>A)  WHEN^(B)j  v  (OT^  ,(-nA=>A) 
WH£N<o(B)1,  under  our  definition,  reduces  to 
@T(A)  WHEN(B),  with  epsilon  values  of  ^x,  under 
the  SCR  definition. 

The  above  example  assumes  that  all  values  are 
polled.  To  demonstrate  the  impact  of  polling  versus 
interrupt  schemes,  we  provide  the  following 
examples  of  code  fragments  that  implement  the 
somewhat  simplified  expression  @T,(A)  WHKN> 

(B)  WHEN<  (Q.  The  reader  may  wish  to  refer  back 
to  Figures  2.  3,  and  4  for  illustrations  of  the  timing 
schemes. 

a.  M"  polled.  The  code  must  alternately  poll  B,  A, 
and  C,  waiting  for  the  sequence: 

B  is  True, 

A  is  False 
[C  is  any  value.) 

[fl  is  any  value  ) 

A  is  True. 

C  is  True. 

An  alternative  approach  would  allow  polling  of  B 
until  it  becomes  true,  and  then  beginning  the  al¬ 
ternate  polling  of  A  and  B.  C  then  need  only  be 
polled  after  A  and  B  have  met  the  criteria 
Particularly  where  B  is  interrupt  driven,  this 
would  mean  that  no  polling  at  all  need  be  done 
until  interrupt  code  is  called. 

The  following  is  a  sample  PASCAl^  code  fragment 
which  implements  the  first-mentioned  approach. 
External  code  would  be  required  to  execute  this 
fragment  at  least  every  x  units  of  time. 

(A  is  initially  True;  B  is  initially 
False) 

OldA  ;=  A; 

01 dB  ;=  B; 
i-oilBiB}  ; 

PollA(A) ; 

PollC(C) ; 

if  A  and  OldB  and  C  and  not  OldA  then 
PerformAction; 


If  optimization  to  minimize  unnecessary  polling  is 
desired,  the  code  complexity  can  be  increased  to 


provide  “short  circuit*  evaluation  It  should  be 
noted  that,  if  the  language  provides  short-circuit 
evaluation  of  boolean  expresskais  statemenis. 
function  calls  could  be  used  for  polling,  achieving 
similar  results  Howe\’er.  this  would  make  tracing 
and  verification  of  polling  sequences  more 
complex,  and  should  pnAiably  not  be  used 
following  code  implements  the  “short  circuit* 
evaluation 


{B  is  initially  False) 

OldB  ;»  B; 
rollB'B) ; 
if  OldB  then 
begin 
OldA  A; 

PoUA(A)  ; 

if  A  and  not  OldA  then 
begin 

PoilC(C) ; 

if  C  Chen  Per forrrAct icn; 
end 
end 
else 
begin 

Pol IB (B)  ; 

if  B  then  PollAiA); 
end 


b  “A*  interrupt  driven  VSTien  A  becomes  true,  the 
interrupt  procedure  roust  check  the  most  recent 
value  of  B,  and  then  poll  C  If  both  B  and  C  arc 
true,  the  event  is  triggered 

Note:  Ax  interrupt  prxKedures  are  rwt 
defined  in  Standard  PASCAL,  ue  xhall 
use  die  rules  of  Borland  Turbo 
PASC.4L  fBORL  91)  The  parameter 
list  u  iU  be  a  single  type,  rather  than 
Borland  's  enumerated  Usl  of  CPU 
registers  An  interrupt  suspends 
execution  of  code  anywhere  in  the 
program  (excepting  higher  priority 
interrupt  handlers)  and  causes 
execution  of  the  interrupt  procedure 
Interrupt  procedures  haie  access  to  all 
global  Lwiables. 

The  following  code  implements  such  a  procedure 
Here,  B  is  a  global  variable  which  is  polled 
externally,  at  least  once  every  x  units  of  time  A 
and  C  might  be  global  variables,  if  their  values  are 
required  elsewhere 


I 

I 

I 

I 

fl 

I 

I 

I 

I 

I 

I 

I 

t 

I 

I 

I 

I 
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procedure  Ainterrupt (Reg  : 

RegistersType) ; 

interr  pt;  (This  procedure  is 
triggered  any  time  the  monitored  value 
of  A  changes} 

var  A,  C  ;  Boolean; 

procedure  InterpretRegisters (Registers 
:  RegistersType,  A  :  Boolean) ; 
begin 

...  (Determines  correct  value  for  A) 
end; 

begin 

InterpretRegisters (Reg,  A)  ; 

if  A  and  B  then 

begin 

PollC(C) ; 

if  C  then  PerformAction 
end 
end; 

In  all  of  the  above  examples,  if  B  and/or  C  are  in¬ 
terrupt  driven,  their  interrupt  procedures  will 
merely  update  global  variables.  As  stated  in 
section  4,  the  interrupt  priorities  will  be;  C>  A> 

B. 

8.  Evaluating  the  Descriptor 

The  following  table  adds  our  descriptor  to  those 
covered  in  Section  4. 


The  goal  of  generality  is  fiilly  met  by  our  prc^>osed 
extensions  of  the  event  descriptor,  covering  the 
sequencing  and  input  behavior  issues.  As 
demonstrated  in  the  simple  coding  examples,  the 
basic  form  of  the  new  descriptor  is  implementable 
with  simple  variable  checking  in  a  single  condi¬ 
tional  statement.  The  more  complex  form  is  also 
demonstrated  without  undue  code  complexity.  We 
also  demonstrate  the  means  to  implement  both 
polled  and  interrupt-driven  inputs.  Finally,  we 
demonstrate  a  verification  based  on  a  axle  walk¬ 
through  type  proof. 

9.  Further  Work 

The  work  presented  here  has  been  produced 
within  the  context  of  a  larger  work  concerning  the 
tracing  between  Pamas  (SCR)  style-specifications 
and  code,  using  the  variable  flavor  annotations  of 
[HOWD  90].  During  the  implementation  of  "toy* 
problems,  it  became  apparent  that  previous 
definitions  of  event  descriptors  v  ere  unsatisfac¬ 
tory,  as  the  descritors  did  not  aiiow  precise 
specification  of  certain  useful  but  complex  events. 
Before  we  could  trace,  we  required  a  working 
definition  of  what  code  is  required  to  actually 
implement  both  simple  and  complex  event 
descriptors.  The  production  of  such  a  descriptor  is 
the  function  of  this  paper.  Our  next  step  will  be  to 
identify  what  sort  of  annotations  should  be 
included  by  the  coder  to  document  the 
imolementation. 
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procedure  Ainterrupt (Reg  : 

RegistersType) ; 

interrupt;  (This  procedure  is 
triggered  any  time  the  monitored  value 
of  A  changes} 

var  A,  C  :  Boolean; 

procedure  InterpretRegisters (Registers 
;  RegistersType,  A  :  Boolean) ; 
begin 

...  (Determines  correct  value  for  A) 
end; 

begin 

InterpretRegisters (Reg,  A)  ; 

if  A  and  B  then 

begin 

PollC(C) ; 

if  C  then  PerformAction 
end 
end; 

In  all  of  the  above  examples,  if  B  and/or  C  are  in¬ 
terrupt  driven,  their  interrupt  procedures  will 
merely  update  global  variables.  As  stated  in 
section  4,  the  interrupt  priorities  will  be:  C  >  A  > 
B. 

8.  Evaluating  the  Descriptor 

The  following  table  adds  our  descriptor  to  those 
covered  in  Section  4. 


The  goal  of  generality  is  fiiUy  met  by  our  proposed 
extensions  of  the  event  de8crg>tor,  covering  the 
sequencing  and  input  behavior  issues.  As 
demonstrated  in  the  simple  coding  examples,  the 
basic  form  of  the  new  descriptor  is  implementable 
with  simple  variable  checking  in  a  single  condi¬ 
tional  statement.  The  more  complex  form  is  also 
demonstrated  without  undue  code  complexity.  We 
also  demonstrate  the  means  to  implement  both 
polled  and  interrupt-driven  mputs.  Finally,  we 
demonstrate  a  verification  based  on  a  code  walk¬ 
through  t)Tpe  proof. 

9.  Further  Work 

The  work  presented  here  has  been  produced 
within  the  context  of  a  larger  work  concerning  the 
tracing  between  Pamas  (SCR)  style-specifications 
and  code,  using  the  variable  flavor  annotations  of 
[HOWD  90].  During  the  implementation  of  "toy" 
problems,  it  became  apparent  that  previous 
definitions  of  event  descriptors  were  unsatisfac¬ 
tory,  as  the  descritors  did  not  allow  precise 
specification  of  certain  useful  but  complex  events, 
^fore  we  could  trace,  we  required  a  working 
definition  of  what  code  is  required  to  actually 
implement  both  simple  and  complex  event 
descriptors.  The  production  of  such  a  descriptor  is 
the  function  of  this  paper.  Our  next  step  will  be  to 
identify  what  sort  of  annotations  should  be 
included  by  the  coder  to  document  the 
implementation. 
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University  of  Maryland.  recognixe  that  the  precondition  is  satisfied 

whenever  A'  is  true  for  an  interval  of  at  least  xl, 

11.  Appendix:  Extended  Example  Code  and  that  it  will  only  detect  the  event  if  A'  is  true  at 

least  once  in  every  sub-interval  of  length  x2  during 
The  example  implementation  of  event-detecting  an  interval  of  length  xl.  Thus,  this  code  fulfills  the 
code  show  in  Section  6  requires  only  minor  definition  of  @T,i  jt2(A’=>  A)  WHEN>y(B) 

extensions  to  allow  for  values  of  xl  that  are  WHEN^^CC). 

greater  than  x2.  This  situation  would  arise  where 
the  truth  of  a  precondition  (A*)  needed  to  be 
assured  over  a  length  of  time  greater  than  the 

001  program  RealTimeProg; 

002  var 

003  A,  APrlme,  B,  C  :  Boolean; 

004  TimeNow,  NextTime,  PollTime,  APDuratlon,  Xl,  X2,  Y,  Z  :  TimeType; 

005  const 

006  SmallTime  :  TimeType  =  (StnallValue) ;  {See  explanation  below) 

007 

006  (functions  Time,  and  procedures  Initialize,  Delay,  PollA,  PollB, 

009  PollC,  PollAPrlme,  and  ReportEventDetected  not  shown) 

010 

Oil  function  AtTrueABC  s  Boolean;  (Returns  true  if  event  occurs) 

012  var 

013  PrevA,  PrevB,  PrevAPrime  : Boolean; 

014  begin 

015  PrevA  :=  A;  (Save  previous  values  of  A  &  B) 

016  PrevB  :=  B; 

017  PollB (B);  (Poll  A,  B,  C  with  delays  of  Y  and  Z] 

018  Delay (Y); 

019  if  APDuration  <=  Xl  -  X2  then  (Check  precondition  only 

020  begin  if  not  already  satisfied) 

021  PrevAPrime  :=  APrime; 

022  PrevTime  ;=  PollTime; 

023  PollAPrime (APrime) ; 

024  PollTime  :=  Time; 

025  if  APrime  then 

026  APDuration  ;=  APDuration  +  PollTime  -  PrevTime; 

027  end; 

028  PollA  (A)  ; 

029  Delay (Z); 

030  PollC (C); 

031  AtTrueABC  ;=  (APDuration  >  Xl  -  X2)  and 
032  not  PrevA  and  A  and  PrevB  and  C; 

033  if  A  or  not  APrime  then  APDuration  :=  0;  (Reset  pre-condition) 

034  end; 

035 

036  begin  (Main  body  calls  AtTrueABC  at  intervals  of  (X2-SmallTime) ) 

037  Initialize (Xl,  X2,  Y,  Z);  (gets  timing  data) 

038  APDuration  0;  (set  initial  values) 

039  APrime  :=  false; 

040  PollTime  ;=  Time; 


041  A  ;=  true; 

042  B  :=  false; 

043  NextTime  Time  +  X2  -  SmallTime;  (Determine  polling  Interval} 

044  repeat  (main  loop — calls  polling  routine) 

045  if  AtTrueABC  then  ReportEventDetected; 

046  repeat  TimeHow  :»  Time  until  TimeNow  >*  NextTime; 

047  NextTime  :=*  NextTime  +  X2  -  SmallTime; 

048  until  false; 

049  end 
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"Knowing  where  you've  come  from  and  where  you  are  is  essential  to  knowing  how  to 
get  where  you  want  to  go.  Developing  a  new  generation  of  products  is  a  lot  li^  taking  a 
journey  into  the  wilderness.  Who  would  dream  of  setting  off  without  a  map?"  ' 

Steven  C.  Wheelwright  and  W.  Earl  Sasser,  Jr. 


INTRODUCTION 


This  paper  proposes  a  product  development  methodology  (PDM)  for  complex  systems 
evolving  within  the  current  economic  climate  of  the  United  States,  as  well  as  the  unstable  state  of 
world  affairs  envisioned  throughout  the  decade.  The  PDM  facilitates  the  development  of  systems 

that  are  "multipuipose,  flexible,  highly  mobile,  and  incorporate  maximum  bang  for  the  buck."^ 

Ironically,  the  unstable  nature  of  the  development  environment  within  the  defense  community 
parallels  the  one  encountered  by  the  commercid  sector  over  the  last  20  years.  Successful 
companies  have  responded  by  adopting  a  product  development  methodology  that  adapts  to  ever 
changing  market  demands  and  the  concern  for  near  term  returns  (profit). 

TTie  PDM  exploits  lessons  learned  from  the  commercial  market  analogy  to  establish  a  flexible, 
low  risk,  cost  effective  approach  for  technological  progress.  The  approach  suits  systems 
development,  especially  those  involving  complex  mission  critical  computer  systems. 

Examination  of  the  commercial  product  development  process  revels  a  strategy  that  can 
achieve  the  procurement  flexibility  needed  by  DoD.  This  strategy  concentrates  on  leveraging  the 
state-of-the-art  in  a  cost  effective  manner.  TTie  strategy  also  addresses  risk  management. 

The  key  to  the  technological  success  of  this  strategy  relies  on  an  incremental  development 
process.  The  IBM  PC  serves  as  a  perfect  example.  The  i486  based  PC  resulted  from  successful 
sales  of  the  286, 386SX  and  i386  based  versions.  Incremental  upgrades  enabled  IBM  to  respond 
to  changes  in  market  demand  as  well  as  facilitate  the  transition  of  the  state-of-the-art.  In  this 
manner,  IBM  attained  strategic  flexibility. 

The  discussion  starts  at  a  basic  level  and  progresses  to  a  macroscopic  perspective.  This  paper 
contains  four  parts:  adaptation  of  a  commercial  approach,  incorporation  into  an  overall  risk 
management  scheme,  application  to  an  open  architecture  transition,  and  a  summary. 

The  paper  recommends  using  open  architectures  and  commercial  off-the-shelf  (COTS)  items 
to  implement  the  incremental  improvements  outlined  by  the  PDM.  The  summary  includes 
guidelines  for  successful  product  development,  as  well  as  ideas  for  future  work  in  this  area. 


ADAPTATION  OF  A  COMMERCIAL  APPROACH 

Decreasing  commercial  product  life  cycles  have  required  technologies  to  be  developed  at 
faster  rates.^  As  a  result,  companies  have  devoted  more  effon  to  the  product  development 
process.  The  process  differs  from  company  to  company;  however,  the  high  tech  arena  focuses 
on  the  time  to  market.  Shortening  the  time  to  market  enables  a  company  to  increase  market  share, 
adapt  product  characteristics  to  market  needs,  enjoy  high  margins  typically  encountered  in  the 

beginning  of  a  product's  life  cycle,  and  shorten  the  payback  period.^ 
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This  focal  point  requires  a  company  to  make  decisions  in  what  Preston  Smith  refers  to  as  the 
"fuzzy  front  end"  of  the  product  development  process.^  Lack  of  quantitative  information  and 
organizational  structure  characterize  the  ambiguity  of  the  "fuzzy  front  end."  Decisions  made  at 
this  time  greatly  affect  the  product's  evolution.  1116  expensive  nature  of  changes  made  down  the 
road  heightens  the  significance  of  these  up  front  decisions.  As  a  product  progresses  from 
planning,  through  design,  production,  test  and  delivery  the  cost  to  correct  an  error  increases.^ 

Smith  offers  a  simple  decision  anjdysis  technique  to  attack  problems  in  the  "fuzzy  front  end." 
The  technique  concentrates  on  the  interrelationships  between  time,  development  cost, 
performance  and  profit.  He  prefers  an  approach  based  on  estimates  generated  quickly.  Smith 
believes  complicated  estimation  tools  waste  time  and  lead  to  a  false  impression  of  the  accuracy  of 
the  available  data.  His  book  presents  several  examples  to  demonstrate  the  merit  of  his  approach. 
Therefore,  this  paper  concentrates  on  the  adaptation  of  Smith's  ideas  to  complex  systems. 

At  first  glance,  the  lack  of  the  profit  motive  within  the  government  appears  to  create  an 
obstacle  to  the  application  of  Smi^'s  approach  to  the  development  of  complex  systems. 
However,  consider  the  savings  the  government  can  pursue  when  building  systems.  This 
viewpoint  establishes  the  profit  motive;  when  systems  cost  less,  profits  from  savings  follow.  In 
other  words,  life  cycle  cost  savings  create  a  profit.  The  analogy  between  profit  and  cost  savings 
permits  a  modification  to  Smith's  model  for  product  development  in  the  defense  sector.  Figure  1 
illustrates  the  product  development  framework  that  results  when  life  cycle  cost  replaces  profit  in 
Smith's  model.  The  arrows  indicate  relationships  between  the  areas  identified  in  the  circles. 


Note,  this  model  separates  development  costs  from  life  cycle  costs.  This  definition  differs 
from  the  definition  of  life  cycle  cost  used  in  Naval  acquisitions.  For  Naval  acquisitions,  life  cycle 
cost  is  the  sum  total  of  the  direct,  indirect,  recurring,  non-recurring,  and  other  related  costs 
incurred,  or  estimated  to  be  incurred  in  the  design,  research  and  development  (R&D),  investment, 

operation,  maintenance,  and  suppoii  of  a  product  over  its  life  cycle."^ 

The  product  development  framework  fosters  decisions  based  on  time,  performance, 
development  cost  and  life  cycle  cost  tradeoffs.  To  achieve  savings,  the  product  development 
framework  must  assume  a  baseline  for  time,  performance  and  cost.  An  existing  system  functions 
as  the  baseline.  The  baseline  system  establishes  cost  and  performance  ceilings.  Measure  time, 
performance  and  cost  in  terms  of  the  incremental  contribution  to  the  baseline  system  when 
developing  new  products  for  existing  systems. 
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This  paper  uses  qualitative  reasoning  to  denionstrate  the  utility  of  the  product  development 
framework.  A  color  coding  scheme  depicts  the  incremental  contribution  for  each  area,  ideally, 
the  color  coding  scheme  would  use  a  tt^fic  light  pattern.  Picture  red  for  an  undesirable  rating, 
yellow  for  an  indeterminate  condition  and  green  for  a  favorable  estimate.  Figure  2  exhibits  the 
alternate  color  coding  scheme  used  in  this  paper  to  facilitate  duplication  of  the  material. 


Decreased  performance, 
increased  cost,  or 
increased  time  (red) . 

Indeterminate  (yellow) . 


Increased  performance, 
decreased  cost,  or 
decreased  time  (green) . 


Figure  2.  Color  Coding  Scheme 

A  quantitative  analysis  would  eventually  replace  qualitative  reasoning.  The  progression  of 
time  enables  the  analysis  to  improve  as  more  information  becomes  available  and  decisions  are 
revisited.  Hence,  the  quantitative  analysis  gets  fine  tuned  as  the  product  becomes  well  defined. 

This  product  development  framework  facilitates  the  assessment  of  life  cycle  costs, 
development  costs,  and  development  time  for  sp^ific  performance  requirements.  Continued 
appraisal  will  result  in  a  set  of  performance  requirements  that  meets  cost  goals. 

One  problem  not  represented  directly  in  the  framework  is  the  difficulty  mustering  support  for 
the  acquisition  of  systems  on  the  basis  of  life  cycle  cost.  Opponents  can  attack  the  fidelity  of 
forecasts  beyond  a  5  year  period. 

However,  a  shorter  product  development  cycle  addresses  this  problem  by  trimming  the 
payback  period.  The  payback  period  is  the  time  it  takes  to  recoup  the  initial  development  cost 
through  life  cycle  cost  savings.  Reduced  payback  periods  strengthen  life  cycle  cost  estimates. 
The  incremental  product  development  approach  capitalizes  on  condensed  payback  periods. 

The  framework  addresses  issues  on  a  discrete  product  basis.  Examples  of  discrete  products 
include:  disk  drives,  power  supplies,  and  stand  alone  computers.  In  contrast,  complex  computer 
systems  represent  an  amalgam  of  discrete  computer  products.  They  require  a  technique  that 
weighs  each  decision  on  a  macroscopic  level.  The  four  element  diagram  cannot  guide  complex 
decisions  without  a  higher  level  of  abstraction.  The  next  section  outlines  the  higher  level. 


RISK  MANAGEMENT  FOR  COMPLEX  COMPUTER  SYSTEMS 

Many  discrete  technical  approaches  compete  for  attention  in  the  "fuzzy  front  end"  of  complex 
computer  systems  development.  The  aforementioned  product  development  framework  expedites 
decisions  on  a  case  by  case  basis,  but  cannot  manage  a  complex  computer  system  in  its  entirety. 
A  useful  methodology  must  provide  a  map  for  macroscopic  considerations.  This  consideration 
differentiates  Smith's  commercial  technique  from  complex  systems  development.  Nevertheless, 
Smith's  framework  forms  a  foundation  for  the  map  us^  in  the  higher  level  of  abstraction. 

Basically,  the  methodology  sets  the  stage  for  each  discrete  approach  to  compete  in  terms  of 
time,  cost  and  performance.  A  three  step  process  establishes  a  clear  path  from  the  "fuzzy  front 
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end"  through  product  definition  to  a  low  risk  development  scenario.  Figure  3  illustrates  this 
process  for  the  development  of  a  complex  computer  system  on  the  basis  of  cost. 


INCREASED  PRODUCT  DEFESTTION 


*  Note  -  Subjective  model  relative  to  existing  baseline. 


Figure  3.  Product  Development  Methodology 

The  first  step  identifies  each  discrete  candidate.  Identification  requires  at  least  a  subjective 
description  in  terms  of  time,  performance,  life  cycle  cost  and  development  cost.  In  fact, 
subjective  approaches  accelerate  the  process.  A  plethora  of  candidate  approaches  typically 
overwhelms  the  front  end  of  complex  computer  system  development.  A  detailed  quantitative 
analysis  of  each  would  consume  time  and  money. 

A  RAND  study  of  process  plants  demonstrates  the  lack  of  accuracy  of  data  in  the  "fuzzy  front 
end."  Process  plant  estimates  generated  on  the  basis  of  R&D  data  alone,  can  easily  overrun 
budgets  by  100%.  As  the  level  of  project  definition  and  quantity  of  engineering  data  increase, 

overruns  decline  to  about  10%  at  a  full  cost  design  stage.* 

Ensure  early  efforts  focus  on  the  rapid  development  of  high  potential  products,  rather  than  up 
front  detailed  cost  analyses.  Get  products  into  existing  systems  quickly.  Keep  the  up  front 


364 


Complex  Systems  Engineering  Synthesis  and  Assessment  Technology  Workshop,  July  20-24,  1992 


analysis  simple  to  reduce  development  costs.  Mitigating  costs  reduces  risk.  In  the  long  term,  the 
incremental  product  development  methodology  promotes  a  diversified  development  portfolio. 

Breaking  down  the  complex  system  development  into  a  step-by-step  sequence  of  limited 
challenges  contains  risk  and  cost.  Northern  Telecom’s  venture  from  analog  into  digital  switching 
systems  for  the  telecommunications  industry  serves  as  an  example.  Instead  of  going  for  the  local 
DMS  100  switch  right  away,  the  company  started  with  the  development  of  a  PBX  ^vate  branch 
exchange),  which  gave  it  a  base  for  understanding  important  new  technologies,  digitization 
techniques,  advanced  programming  languages,  and  network  design.  Treating  the  effort  as  a  step- 
by-step  sequence  of  more  limited  challenges  allowed  Northern  Telecom  to  contain  its 
development  risk  and  keep  development  costs  from  going  through  the  roof.^ 

The  second  step  involves  a  discussion  of  each  candidate  approach.  The  discussion  includes 
establishing  fundamental  criteria  for  advanced  development,  technology  insertion  and  future 
consideration.  Discussion  challenges  the  subjective  nature  of  the  data. 

The  third  and  final  step  sorts  the  candidates  into  those  considered  for  immediate  advanced 
development,  future  technology  insertion,  and  further  consideration.  At  this  part  of  the 
development  process  the  high  priority  candidates  require  a  detailed  quantified  analysis. 

Depending  on  cost  goals,  products  can  move  to  and  from  the  advanced  development  model 
(ADM)  and  technology  insertion.  Existing  ADM  products  replaced  by  candidates  from  the 
technology  insertion  area  act  as  contingencies;  they  create  a  backup  in  case  of  product  failures. 

The  identification,  discussion  and  prioritization  (IDP)  of  discrete  candidates  lead  to  system 
definition.  Figure  3  shows  how  development  proceeds  from  the  "fuzzy  front  end"  to  clear-cut 
product  definition.  The  process  mitigates  risk  by  giving  priority  to  approaches  that  yield  the 
biggest  cost  savings  with  the  shortest  payback  period.  As  time  processes  the  product  becomes 
well  defined  and  information  is  available  to  make  decisions  quantitatively  rather  than  qualitatively. 


APPLICATION  OF  THE  PPM  TO  AN  OPEN  ARCHITECTURE  TRANSITION 

Traditionally,  combat  system  development  requires  a  quantum  leap  in  performance.  Figure  4 
exemplifies  the  computing  throughput  required  by  expanding  sensor  array  configurations. 
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In  addition  to  greater  computational  requirements,  this  trend  also  creates  demands  for  faster 
communication  lii^s  and  denser  memory  configurations.  One  approach  to  meeting  these 
demands  involves  the  incorporation  of  an  open  architecture.  Open  architectures  leverage  fast 
paced  commercial  technology  development  by  maintaining  compatibility  with  commercial 
standards. 

Case  histories  show  building  off  existing  foundations  of  core  technologies  generates  success 
in  the  commercial  sector.  Companies  that  focus  new  products  on  extensions  to  a  single  key 
technology  are  far  more  successful  than  those  that  pursue  technical  diversity.  "The  best 
opportunities  for  rapid  growth  come  from  building  an  internal  critical  mass  of  engineering  talent 
in  a  focused  technological  area,  yielding  a  distinctive  core  technology  that  might  evolve  over  time, 

to  provide  a  foundation  for  the  company's  product  development.''^^ 

Accordingly,  the  transition  from  an  existing  sensor  processing  system  to  an  open  architecture 
based  system  offers  a  prime  example  of  the  utility  of  the  PDM  proposed  in  this  paper.  The 
combination  of  the  PDM  and  open  architecture  philosophies  facilitates  future  technology  upgrades 
for  the  sensor  system.  Hardware  and  software  commonality  contain  costs. 

Figure  5  shows  an  open  architecture  for  a  sensor  processing  system.  The  key  features  of  this 
architecture  are  the  sensor  distribution  network,  the  data  distribution  network  and  the  common 
processing  cabinets.  Tlie  common  processing  cabinets  fit  into  the  open  architecture  scheme  by 
utilizing  a  commercially  available  bus  architecture  for  the  backplane.  Any  vendor  can  integrate 
equipment  into  the  system  as  long  as  they  adhere  to  the  interface  standards. 


Figure  5.  Notional  Open  Architecture 
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The  lack  of  mature  standards  poses  an  obstacle  to  the  implementation  of  open  architectures. 
For  example,  many  of  the  Next  Generation  Computing  Resources  (NGCR)  initiative's  interface 
standards  have  not  been  written.  Therefore,  cost  and  performance  are  indeterminate. 

In  addition,  many  existing  combat  systems  do  not  possess  open  architecture  attributes.  On 
one  hand,  existing  system  baselines  minimize  development  costs.  On  the  other  hand,  open 
architectures  facilitate  life  cycle  development.  The  current  fiscal  environment  within  DoD  does 
not  favor  system  development  programs  with  a  high  cost  profile.  Nonrecurring  engineering 
funds  are  shrinking.  An  incremental  transition  from  an  existing  system  to  open  architecture 
would  spread  out  the  cost  and  mitigate  the  risk.  The  incremental  approach  also  advances  the  long 
term  goals  of  open  architectures. 

TTie  product  development  methodology  proposed  in  this  paper  helps  attain  this  goal.  Using 
an  existing  system  as  a  baseline,  the  transition  t^es  a  low  risk  path  to  incrementally  integrate 
open  architecture  concepts  into  the  complex  computer  system.  Figure  6  depicts  this  concept  for  a 
generic  sensor  processing  chain. 


The  network  interface  builds  an  open  architecture  upon  the  strong  foundation  of  the  baseline 
sensor  processing  chain.  First,  tap  into  the  processing  chain.  Next,  insert  the  subsets  of  the 
open  architecture  through  the  network  interface  units.  For  example,  evaluate  a  novel  beamformer 
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by  bypassing  the  existing  beamformer.  Gradual!}  change  the  system  as  technology  becomes 
available  at  a  suitable  cost.  Eventually,  the  open  architecture  replaces  the  baseline  sensor  chain. 

The  PDM  facilitates  consideration  of  various  implementations  for  each  functional  block. 
Commercial  off-the-shelf  equipment  serves  as  an  excellent  implementation  for  initial  product 
development.  COTS  equipment  maintains  the  cost  for  initial  test  and  evaluation.  Test  th<^  COTS 
prototypes  for  shock,  vibration,  temperature,  etc.,  and  deploy  the  equipment  with  acceptable 
performance.  Militarize  the  COTS  equipment  if  environmental  test  results  fail  short  of 
expectations.  Risk  mitigation  occurs  because  the  government  expends  additional  capital  only  for 
verified  performance.  In  addition,  the  availability  of  the  baseline  system  serves  as  a  contingency 
to  reduce  the  risk  of  product  failure. 

Table  I  clarifies  the  utility  of  the  PDM  for  a  next  generation  sensor  system.  In  this  scenario,  a 
sensor  system  already  in  production  serves  as  the  baseline.  Generic  canidates  for  technology 
insertion  includes  today's  technology,  near  term  upgrades,  and  projected  future  commercial 
technology. 


Processing  Type 

Today’s 

Technology 

Technology 

Near  Term  (<1  Year) 

Future  (>5  Years) 

Conventional  Beamformers 

4,000  MIPS 

4,000  MIPS 

40,000  MIPS 

Adaptive  Beamformers 

NA 

Signal  Processing 

0.3  GFLOPS 

2  GFLOPS 

21  GFLOPS 

Data  Processing 

70  MIPS 

70  MIPS 

300  MIPS 

Data  Processing 

35  68030s 

35  68030s 

130  68040s 

Data  Processing  &  I/O 

130  68030s 

130  68030s 

500  68040s 

*  All  units  represent  effective  capability  on  a  per  cabinet  basis. 


Table  I.  PDM  Application  to  Open  Architecture  Technology  Insertion 

Note  the  equipment  undergoing  advanced  development  functions  as  an  excellent  augmentation 
to  the  baseline  system.  The  technology  insertion  category  would  include  near  term  signal 
processing  and  Captive  beamforming  technologies.  Future  oriented  technologies,  which  include 
parallel  processors  (e.g.,  iWarp,  Paragon,  Connection  Machine)  repackaged  in  multichip 
modules,  do  not  demonstrate  a  definitive  payoff.  The  PDM  suggests  reconsideration  of  these 
technologies  when  the  commercial  market  brings  down  the  cost.  In  the  interim,  develop  the 
network  interface  units  to  facilitate  the  insertion  of  the  advanced  technologies  as  the  market 
matures. 
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SUMMARY 

This  paper  proposes  an  economical  product  development  methodology  for  complex 
computer  systems.  The  PDM  exploits  the  commercial  market  analogy  to  establish  a  flexible,  low 
risk,  cost  efective  approach  for  technological  progress.  The  strategy  pursues  the  state-of-the  ar 
while  addressing  risk  management.  The  key  to  the  methodology  is  an  incremental  development 
process. 

The  PDM  assesses  discrete  candidates  and  simplifies  macroscopic  decisions.  The  process 
provides  the  opportunity  to  pursue  the  state-of-the-art  while  concentrating  on  choices  that 
emphasize  low  risk,  cost  savings,  and  short  payback  periods. 

Several  guidelines  enhance  the  chances  of  attaining  this  objective: 

1 .  increase  performance/technology  in  an  incremental  fashion, 

2.  use  subjective  decision  techniques  to  eliminate  poor  candidates  from  the  start, 

3 .  fine  tune  detailed  quantitative  analyses  as  the  product  becomes  well  defined, 

4.  create  a  diversified  development  portfolio  directed  at  a  quantum  leap  in  technology, 

5.  use  COTS  equipment  for  rapid  prototyping, 

6.  build  products  quickly  to  reduce  development  costs,  and 

7 .  cultivate  products  with  short  payback  periods. 

The  methodology  has  particular  application  to  complex  mission  critical  computer  systems. 

An  open  architecture  transition  illustrates  the  utility  of  EDP  product  development 
methodology.  The  PDM  sets  priorities  for  a  sensor  processing  chain  by  subjectively  identifying 
the  time,  cost  and  performance  characteristics  of  candidate  technologies. 

System  engineers  can  use  the  product  development  methodology  described  in  this  paper  by: 

1 .  creating  a  detailed  handbook  to  guide  decisions, 

2.  devising  an  expert  system  to  expedite  the  selection  for  technology  insertion, 

3.  applying  software  tools  which  refine  the  hierarchical  decision  process,  and 

4  using  the  PDM  as  a  basis  for  developing  complex  mission  critical  computer  systems. 

The  PDM  integrates  the  choices  input  at  lower  levels  into  higher  level  systems  engineering 
decisions.  System  engineers  could  clearly  define  the  affect  of  the  paths  from  subsystem  to 
system  level  design.  Each  candidate  at  the  lower  level  contributes  to  overall  cost,  performance, 
schedule  and  risk  assessments.  In  this  manner,  the  PDM  enables  an  efficient  approach  for 
systems  engineering. 
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Abstract 

In  this  paper  we  propose  several  strategics  for  improving  the  safety  margin  of  a  real-time  system 
using  the  rale  monotone  algorithm,  by  utilizing  application  characteris  ics.  The  rate  monotonic 
scheduling  algorithm  assumes  that  all  tasks  are  initiated  simultaneously.  In  this  work,  we  relax  this 
worst-case  assumption  and  determine  the  optimal  initiation  times  for  a  2-lask  system  to  increase 
the  utilization  bound  of  the  system.  It  turns  out  that  the  achievable  utilization  depends  also  on  the 
relationship  of  task  periods.  We  then  investigate  this  relationship  and  show  how  ta.sk  periods  may 
be  modified  to  further  optimize  the  utilization  bound.  This  results  in  an  increased  safety  margin 
of  the  system.  We  derive  analytically  expressions  for  optimal  initiation  times  and  utilization  bound 
for  a  2-task  system.  Extending  a  similar  analytical  study  to  a  system  with  an  arbitrary  number  of 
tasks  is  extremely  complex.  So  we  develop  algorithms  using  the  same  ideas  and  simulation  results 
show  a  similar  increase  in  the  safety  margin  of  the  system. 


1  Introduction 

The  rate  monotonic  scheduling  algorithm  was  intr<  duced  by  Liu  and  Layland  [3]  and  is  known  to  be 
optimal  among  static,  priority  driven,  preemptive  scheduling  algorithms  for  real-time  environments, 
subject  to  its  underlying  assumptions.  It  is  optimal  in  the  sense  that  no  other  fixed  priority  assignment 
rule  can  schedule  a  task  set  which  cannot  be  scheduled  by  the  rate  monotone  priority  assignment. 
The  assumptions  include  strict  periodicity,  task  independence,  constant  running  times.  Some  of  these 
assumptions  restrict  applicability  to  specific  system  models.  Sha  and  others  (1, 2,  5,  7,  6]  have  enhanced 
this  algorithm  by  devising  techniques  to  deal  with  non-independent  task  sets,  aperiodic  tasks,  stochastic 
execution  times  and  resource  sharing.  These  were  attempts  at  making  the  rale  monotone  algorithm 
applicable  to  different  system  models,  by  relaxing  some  of  the  assumptions.  In  our  work,  the  system 
model  is  the  same,  but  we  use  the  semantic  information  about  the  application  to  avoid  worst  case 
situations.  The  two  characteristics  we  exploit  here  are  the  laxk  imttaUon  times  and  the  /a.si  periods. 

Liu  and  Leyland  [3]  showed  that  simultaneous  initiation  of  tasks  creates  worst-case  situations.  Sub¬ 
sequent  research  has  continued  with  this  assumption.  Similarly  current  research  also  assumes  that  tasks 
have  some  fixed  user-specified  periods.  The  motivation  for  this  research  stems  from  the  fact  that  in  typ¬ 
ical  practical  applications,  system  designers  do  have  some  choices  in  the  selection  of  the  exact  periods 
and  initiation  times  of  periodic  tasks. 
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For  example,  an  application  sach  as  the  space  station  may  contain  several  kinds  of  periodic  tasks:  data 
acquisition  tasks,  such  as  obtaining  and  recording  the  values  from  sensors;  situation  monitoring  tasks, 
such  asihe  cabin  temperature  monitoring  performed  by  the  life  support  system;  and  control  applications, 
such  as  navigation.  The  periodicity  requirement  on  data  acquisition  and  situation  monitoring  is  typically 
of  the  form  "perform  at  least  every  5  seconds”.  If  the  system  designers  actually  choose  a  shorter 
period  because  that  fits  the  scheduling,  the  system  performance  will  only  be  improved.  Also,  since  the 
requirement  is  only  that  each  task  be  performed  periodically,  unrelated  tasks  may  not  have  any  restriciton 
on  their  relative  phasing.  The  time  at  which  the  cabin  pressure  is  monitored  can  be  quite  independent 
of  the  time  at  which  navigation  commands  are  issued.  Thus,  except  in  cases  where  a  particular  phasing 
is  specified  because  of  dependencies,  designers  can  phase  tasks  in  any  way  which  suits  the  scheduling. 
Since  most  other  real-time  applications  also  commonly  perform  data  acquisition,  control  and  monitoring 
functions,  similar  flexibility  is  likely  to  be  available  in  a  wide  range  of  systems. 

We  utilize  this  flexibility  to  obtain  two  kinds  of  benefits  in  the  design  of  the  scheduling.  The  first 
is  to  enhance  the  schedulability  of  the  system  by  increasing  the  utilization  bound.  In  doing  this,  it  is 
possible  that  some  task  sets  which  were  previously  unschedulable  may  now  become  schedulable.  The 
second  goal  of  our  work  is  to  increase  the  safety  margin  of  the  system.  If  we  have  determined  bounds 
on  the  computation  times  of  the  tasks  in  the  system,  and  the  analysis  shows  that  the  set  of  tasks  is 
schedulable,  it  is  still  desirable  to  provide  for  exceptional  situations  where  the  computation  times  of  some 
tasks  happen  to  exceed  the  computed  bound.  We  use  the  term  safety  margin  to  denote  the  extent  by 
which  the  system  utilization  may  increase  before  the  scheduling  breaks  down  and  deadlines  are  missed. 
By  adjusting  task  initiation  times  and  selecting  task  periods  appropriately,  we  increase  the  safety  margin 
of  the  system,  at  no  run-time  cost. 

Our  work  is  divided  into  two  parts.  The  first  is  an  analytical  study  of  a  2-task  system.  We  determine 
optimal  initiation  times,  methods  to  modify  the  task  periods  and  study  the  effects  of  these  modifications 
on  the  safety  margin  of  a  2-task  system.  However,  it  turns  out  that  the  analytical  techniques  used  do 
not  extend  conveniently  to  multi-task  systems.  Hence,  in  the  second  part  of  our  research,  we  develop 
algorithms  to  determine  initiation  times  for  the  tasks  and  to  modify  the  periods  in  order  to  improve  the 
safety  margin.  We  evaluate  the  performance  of  our  algorithms  by  a  simulation  study,  which  involves 
determining  the  breakdown  utilization  of  the  task  set  as  described  by  Lehocz’  v,  a  and  Ding  [Ij. 

The  remainder  of  the  paper  is  organized  as  follows.  In  section  2  we  introduce  the  previous  enhance¬ 
ments  to  the  rate  monotone  algorithm  to  which  this  paper  is  a  contribution.  In  section  3  we  present 
the  analytical  study  of  a  two  task  system  and  illustiate  by  examples  how  the  results  can  be  used.  In 
section  4  we  develop  algorithms  to  determine  initiation  times  and  modify  teisk  periods,  in  a  system  with 
an  arbitrary  number  of  tasks.  We  also  describe  a  simulation  study  to  evaluate  the  performance  of  these 
algorithms  and  present  the  results.  In  section  5  we  summarize  our  results  and  conclude  our  work. 


2  Background 

Liu  and  Layland  [3]  developed  analytical  results  on  the  behavior  of  the  rate  monotone  scheduling  algo¬ 
rithm.  In  order  to  do  so  they  made  a  number  of  assumptions  that  include  strict  periodicity,  constant 
running  times,  independent  task  sets  and  simultaneous  initiation  of  tasks. 

The  rate  monotone  algorithm  assigns  higher  priorities  to  tasks  with  higher  request  rates.  Such  a 
priority  assignment  is  optimum  in  the  sense  that  no  other  fixed  priority  assignment  rule  can  schedule 
a  task  set  which  cannot  L  scheduled  by  the  rate  monotone  priority  ^lssignment.  The  request  rate  of  a 
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task  is  defined  to  be  reciprocal  of  its  request  period.  The  utilization  factor  of  an  n  task  system  is  defined 
to  be 


Ci 


•=i 


where  C<  is  the  computation  time,  and  Tj  is  the  period  of  task  rj,  1  <  »  <  n- 


They  proved  that  this  algorithm  can  schedule  any  set  of  n  periodic  tasks  with  processor  utilization 
no  larger  than  n(2«  —  1)  or  any  task  set  of  any  size  with  processor  utilization  below  ln2  ss  0.693  [3]. 
They  show  that  a  critical  instant  occurs  when  requests  of  all  tasks  arrive  simultaneously,  and  this  is  the 
worst-case  situation  i.e.  if  all  deadlines  are  met  during  the  critical  region,  the  system  will  always  meet 
all  its  deadlines. 


The  Liu  and  Layland  bound  decreases  monotonically  from  .83  when  n  =  2  to  logt2  =  .693  as 
n  — ►  oo.  However,  the  rate  monotone  algorithm  can  often  successfully  schedule  task  sets  having  a  total 
utilization  higher  than  .693.  In  fact,  task  sets  with  utilization  as  high  as  .9  are  often  schedulable.  This 
suggests  that  the  average  case  behavior  is  substantially  better  than  the  worst  case  behavior.  Lehoczky, 
Sha  and  Ding  [1]  describe  an  exact  schedulability  criterion  to  determine  task  set  schedulability.  They 
perform  a  stochastic  analysis  and  determine  that  for  uniformly  distributed  tasks,  a  breakdown  utilization 
of  .88  is  a  reasonable  characterization  of  breakdown  utilization  level. 


Sha,  Lehoczky  and  Rajkumar  [5]  address  the  problem  of  stochastic  execution  times.  In  many  ap¬ 
plications  the  worst  case  execution  time  is  much  larger  than  the  average  case  execution  time  and  a  low 
processor  utilization  would  result  if  it  is  to  be  ensured  that  the  system  never  becomes  overloaded.  Hence 
in  order  to  achieve  a  reasonable  average  processor  utilization  a  scheduling  algorithm  must  be  able  to 
take  care  of  transient  overloads.  A  period  transformation  method  is  developed  to  enhance  the  stability 
of  the  algorithm. 

A  real-time  system  has  both  periodic  and  aperiodic  jobs.  Lehoczky,  Sha  and  Strosnider  [2]  developed 
the  Deferrable  Server  (DS)  and  the  Priority  Exchange  (PE)  algorithm,  based  on  the  rate  monotone 
scheduling  algorithm.  Both  algorithms  give  a  greatly  improved  average  response  time  for  soft  deadline 
aperiodic  tasks  while  still  guaranteeing  the  deadlines  of  periodic  jobs.  The  Extended  Priority  Exchange 
(EPE)  algorithm,  an  extension  of  PE  was  developed  by  Sprunt,  Lehoczky  and  Sha  [7].  Sprunt,  Sha  and 
Lehoczky  [8]  devised  the  sporadic  server  algorithm,  an  improvement  over  the  above  algorithms. 

Sha,  Rajkumar  and  Lehoczky  [6]  have  developed  a  priority  inheritance  protocol  and  derived  a  set  of 
sufficient  conditions  under  which  a  set  of  periodic  tasks  that  share  resources  using  this  protocol  can  be 
scheduled  using  the  rate  monotone  algorithm. 

All  of  the  above  enhancements  are  attempts  at  making  the  rate  monotone  algorithm  applicable  to 
different  system  models.  Our  work  however  focuses  on  the  avoidance  of  worst  case  situations.  The  key 
concept  in  our  work  is  avoidance  of  critical  instants.  We  first  show  how  this  can  be  done  by  choosing 
initiation  times  of  the  tasks.  Then  we  show  how  task  periods  may  be  selected  to  avoid  simultaneous  task 
arrivals.  We  do  this  for  a  system  of  two  periodic  tasks  by  analyzing  the  structure  of  task  arrivals  in  each 
time  period.  We  then  analyze  how  these  modifications  affect  the  safety  margin  of  the  system.  Extending 
the  analytical  techniques  to  a  general  n  task  system  is  extremly  complex.  We  describe  algorithms 
to  determine  initiation  times  and  task  periods  based  on  these  ideas  and  determine  by  simulation  the 
improvement  to  the  safety  margin  of  the  system. 
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3  Analysis  of  a  2-Task  System 

3.1  Structure  of  Task  Arrival  Patterns 


In  this  section  we  derive  the  request  patterns  over  different  periods  for  the  case  of  two  periodic  tasks. 
Assume  without  loss  of  generality  that  T\  <  Tj.  The  rate  monotonic  scheduling  algorithm  assigns  higher 
priority  to  the  task  T\  .  Let  the  ith  period  of  a  task  be  the  interval  between  the  arrival  of  the  ith  request 
and  the  deadline  for  that  request.  Let  U  denote  the  ith  period  of  the  task  rj.  Since  the  rate  monotonic 
algorithm  is  preemptive  and  rj  has  a  higher  priority,  the  available  time  for  the  execution  of  t2  in  /<  is 
the  amount  of  time  that  remains  in  It  after  executions  of  the  task  ti  .  Let  C2,  denote  the  available  time 
for  the  execution  of  rj  in 

If  for  some  i,  the  execution  time  C2  of  the  task  T2  is  larger  than  C2, ,  then  the  system  consisting  of 
these  two  tasks  is  not  schedulable  [3].  The  key  idea  here  is  to  move  the  starting  time  of  the  second  task 
r2  so  as  to  maximize  the  smallest  C2,  over  all  i. 

We  derive  the  structure  of  /,’s  based  upon  the  periodicity  of  the  tasks  and  the  assumption  that 
periods  Tj  and  T2  as  well  as  the  run-time  Ci  are  constants.  In  this  section  we  determine  how  this 
structure  changes  depending  on  the  relative  starting  points  of  the  two  tasks. 

The  structure  of  !{  can  be  characterized  by  the  first  request  of  the  first  task  rj  in  /,  .  Let  Di  denote 
the  time  interval  between  the  ith  request  of  the  task  T2  and  the  first  request  of  the  task  n  in  /<.  Let  k 
be  the  least  common  multiple  of  7i  and  T2  divided  by  T2.  Hence,  Dt  =  Do- 

Observation  1  Di 's  are  distinct  for  all  0  <  i  <  k  —  1. 

We  can  characterize  /<  by  its  Di-  Let  be  a  sorted  list  of  D,’s  for  0  <  i  <  1  —  1.  Let  I-  be  a  list  of 
7,  ’s  with  /•  corresponding  to  If  the  task  r2  starts  6  after  the  start  of  the  first  task  rj  then  for  all  Di 
that  are  larger  than  S,  Di  will  decrease  by  S  and  for  all  Di  that  are  smaller  than  6,  Di  will  increase  by 
Ti  —  6.  Let  Di  be  that  updated  Di'. 

{Di- 6  if  0  <  5  <  £>i 

Di-6  +  Ti  ifDii)<6<Ti 
6-D{i)  i(-Di<S<0 

Tj  —  Di  -h  ^  if  — T|  ^  b  ^  — Di 

Let  DJ’s  be  a  sorted  list  of  D^’s  for  0  <  i  <  A:  —  1. 

Observation  2  If  there  are  i  and  j  such  that  Di  =  Dj  then  D'  =  D'^  for  all  r. 

Lemma  1  For  all  0  <  i  <  k  —  I  DJ+i  —  Dj  = 

There  are  two  consequences  to  this  lemma.  For  any  starting  point  6  of  the  second  task  T2  there  are 
jfc  distinct  interval  structures  of  r2,  where  k  is  dependent  on  periods  Ti  and  72  only.  Second,  since  the 
entire  system  has  translational  symmetry  with  period  it  is  sufficient  to  consider  starting  points  of 
the  second  task  between  0  and  Also,  shifting  r2  by  0  <  S  <  ^  is  equivalent  to  shifting  n  by  —S  or 
by^-5. 


1 _ 

Case  2 

Case  4 

Case  J 

Case  1  j 

1 

0 

1  1  1 
x\ 

Figure  1:  Relationship  between  C\  and  the  end  of  the  first  period  of 

3.2  Determining  Optimal  Initiation  Times 

In  this  section  we  consider  Ti,  Tj  and  Ci  to  be  given,  and  we  shift  the  starting  time  of  ri  by  0  <  S  <  ^ 
in  order  to  maximize  the  running  time  of  T2  for  which  the  task  set  is  still  schedulable.  Let  C2  be  the 
maximum  running  time  of  T2  for  5  =  0  and  C2  +  Cj  be  the  maximum  running  time  over  all  5. 

From  Lemma  1  when  two  tasks  start  together  we  know  that  the  first  request  of  the  task  ri  in 
arrives  at  from  the  beginning  /^,  where  0  <  m  <  fc.  Let  us  consider  several  cases  of  the  values  of 
Cl  and  their  relationship  to  the  starting  point  of  the  first  task  xj  (see  Figure  1). 

Let  the  second  task  be  initiated  at  time  0  and  the  first  task  be  initiated  at  time  5  <  0.  Note  also 
that  by  definition 

as  long  as  Tj  is  not  a  multiple  of  Ti-  (If  Tj  is  a  multiple  of  Ti,  Liu  [3]  shows  that  a  utilization  of  1  can 
be  obtained.)  The  time  available  for  the  execution  of  xj  in  /,•  is  simply  the  time  when  xi  is  not  executing 
and  it  changes  with  5. 

Due  to  space  limitations,  detailed  proofs  of  the  results  shown  below  are  not  included  here.  However, 
proofs  of  these  and  other  results  may  be  found  in  [4]. 

Case  1  :  When  Ci  is  such  that  the  last  execution  of  xj  in  /q  ends  at  least  ^  before  the  end  of  /q,  any 
shift  of  the  starting  time  of  xj  to  the  right  does  not  change  the  available  time  for  the  execution  of 
Xj  in  the  critical  region,  since  the  time  gained  at  the  beginning  of  the  period  equals  the  time  lost 
at  the  end  of  it.  Hence,  when  0  <  Cj  <  (T2  —  T\  [^J)  —  no  increase  in  utilization  is  possible. 

i/  =  i+Ci[(^)-(^)r^i]  (2) 


U  monotonically  decreases  with  increase  in  Ci- 
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Case  2  :  When  Ci  is  such  that  the  last  execution  of  n  in  Iq  ends  at  least  ^  after  the  end  of  the  period 
of  T2,  then  {T2  —  TiL^J)  +  ^  ^  Ci  <  Ti.  Let  us  shift  the  starting  point  of  ti  to  the  left  by 
0<  5  <  ^.  The  time  available  for  execution  of  rj  in  the  critical  region  /(  does  not  change,  since 
Ti  will  execute  during  the  time  gained  at  the  end  of  the  period  and  was  executing  in  the  time  lost 
at  the  beginning  of  the  period.  Hence,  as  in  the  first  case  no  increase  in  utilization  is  possible. 

y  =  (|)L^J  +  c,((^)-(^)L^j]  (3) 

U  monotonically  increases  with  increase  in  C'l. 

Case  3  :  When  Ci  is  such  that  the  last  execution  of  ri  in  /q  ends  between  ^  before  the  end  of  /q  and 
the  end  of  /q,  then  a  shift  of  the  starting  time  of  n  to  the  left  changes  the  available  time  for  the 
execution  of  Tj  in  its  first  period.  In  this  case  (T2  —  Tj  [^J)  —  ^  <  Ci  <  T2  —  Ti 

^  2  ^  T2  2  •  T2 

Case  4  :  When  Ci  is  such  that  the  last  execution  of  rx  in  Iq  ends  between  ^  after  the  end  of  /q  and 
the  end  of  Iq  then  a  shift  of  the  starting  time  of  ri  to  the  right  changes  the  available  time  for  the 
execution  of  T2  in  its  first  period.  In  this  case  T2  —  T\  <C\<T2~T\  ^ 


Ci  =  5  = 


72-7-iL^J  +  ^-Ci  T2~TxI^J  +  ^-Cx 

2  ”  ^  T2  2  •  r2 


In  both  Case  3  and  Case  4  the  new  utilization  is 

Tx  T2  T2  ^  ^ 

where  CJ  is  the  maximum  possible  increase  in  C2  in  the  critical  time  zone,  due  to  the  optimal  starting 
times  of  the  tasks. 

r,  _1  /I  Tl.  Ci,72  1^2.,  1. 

2^^2*  T2Wi^^'^^2k*  T2^^  Ti^Tx  2^ 

Let  /  =  ^  —  .  Then  utilization  f/  is  a  function  of  /  and  has  the  following  dependency  on  it. 


If  /  <  5  then  U  monotonically  decreases  with  increase  in  Cj. 
If  /  >  ^  then  U  monotonically  increases  with  increase  in  Cx- 
If  /  =  i  then  t/  is  a  constant. 


The  figure  5  represents  the  variation  of  the  processor  utilization  bound  with  the  execution  time  of 
the  first  task  Cx-  It  also  compares  the  utilization  bound  obtained  by  having  both  tasks  start  at  the 
same  time  U(old),  with  that  obtained  by  having  them  start  at  different  times,  U(new).  We  notice  that 
U(new)  is  an  improvement  on  U(old)  in  regions  of  Cx  where  U(old)  has  its  minimum  value. 
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3.3  Modifying  Task  Periods  with  Optimal  Initiation  Times 


In  this  section,  we  try  to  improve  the  processor  utilization  bound  by  changing  slightly,  the  periods  Tj 
or  T2  or  both,  of  the  two  tasks  rj  and  rj,  respectively.  In  section  3.2  Ci,  Ti  and  T2  are  assumed  to 
be  constants  and  C2  is  eidjusted  to  fully  utilize  the  available  processor  time  in  the  critical  time  zone. 
Then  the  difference  in  starting  times  of  the  two  tasks  that  gives  the  maximum  possible  increase  in  C2 
is  determined  and  U  is  given  by  equations  5.  Here  we  assume  that  the  two  tasks  start  at  the  different 
starting  times,  such  that  it  gives  the  best  improvement  of  utilization  bound. 

In  the  range 

Tz  -  ^  <  C,  <  n  ^ 

improvements  of  utilization  are  possible. 


1.  If  /  <  5  f/  monotonically  decreases  with  increase  in  Ci.  Minimum  occurs  at 


Ti  Tt 


The  utilization  bound  is 


U(new)  =  1 


/(i-(/  +  i)) 
i  +  f 


where 

/  =  Tf  ~  L^J  ^  =  It?]- 

2.  If  />  ^  U  monotonically  increases  with  increase  in  Ci  Minimum  occurs  at 

C.=T2-T,Lgj-^ 


The  utilization  bound  is 


3.  If  /  =  5  t/  is  a  constant.  Minimum  occurs  over  the  whole  range 

Ti  Ti  Ti 

Ti-Tl^j  _  ^  <  C,  <  T2  -  TiL^J  + 

U(new)  is  given  by  any  of  the  above  equations  6,  7. 

In  the  paper  by  Liu  and  Layland,  the  utilization  bound  is 

/(!-/) 


Ti 


U(old)  =  1  - 


I  +  f 


(6) 


(7) 


(8) 


Each  value  of  /  has  only  one  value  of  k  associated  with  it.  If  /  =  |,  where  a,b  are  relatively  prime 
integers,  then  k  =  b.  Thus  given  the  value  of  /  ,  the  value  of  k  can  be  determined.  Each  finite  value  of  k, 
k  >  2,  has  finite  number  of  values  of  /  associated  with  it.  The  values  of  /  associated  with  k  correspond 
to  j ,  where  x  is  any  integer,  1  <  x  <  k,  and  x  and  k  are  relatively  prime.  For  example,  if  t  =  6,  the 
corresponding  /  values  are  |  and  |.  Thus,  the  number  of  values  of  /  associated  with  a  single  value  of  k 
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depends  on  whether  k  is  prime,  and  if  not,  the  number  of  integers  less  than  k  that  have  a  common  factor 
with  k.  Thus  for  a  particular  k,  all  possible  /  values  can  be  generated,  and  hence  all  possible  values  of 
the  utilization  bound. 

Table  1  (Appendix  A)  shows  the  various  utilization  bounds  for  2  <  ib  <  20.  It  is  obtained  by 
generating  all  possible  /  values  for  each  k  value  and  substituting  in  equations  6,  7. 

Graph  1,  obtained  from  this  table,  is  a  plot  of  both  U(old)  and  U{new)  for  various  values  of  /  and 

k. 

3.4  Safety  Margin 

In  a  lot  of  applications,  task  execution  times  are  stochastic,  and  the  worst  case  execution  time  is  much 
more  than  the  average  execution  time.  If  the  system  was  designed  to  take  care  of  all  overloads,  a  very 
low  utilization  would  result.  Hence,  in  order  to  have  better  utilization  the  system  must  have  the  capacity 
to  handle  occasional  transient  overloads. 

Increase  in  execution  times  of  the  tasks,  increases  the  utilization  of  the  system.  If  the  utilization 
bound  is  high  enough,  this  will  not  affect  the  schedulability  of  the  system.  Thus,  the  safety  margin, 
which  we  define  as  the  difference  between  the  utilization  bound  of  the  system  and  the  actual  utilization 
of  the  system,  becomes  important. 

In  this  section  we  compare  the  safety  margins  of  two  systems,  defined  as  follows. 

System  1  ;  A  system  of  two  tasks  ri  and  with  periods  Tj  and  Tj,  execution  times  Ci  and  C2, 
respectively.  Both  tasks  are  initiated  at  the  same  time  as  considered  in  Liu  and  Layland’s  paper. 

System  2  :  Same  as  the  first  system,  except  that 

•  The  starting  time  of  one  task  is  changed  relative  to  the  other,  so  as  to  give  the  best  possible 
improvement  to  the  utilization  bound  and 

•  T2  is  changed  to  amount  z,  to  get  further  improvement  in  the  utilization  bound,  as  shown 

in  the  previous  section. 

In  System  1,  let  f/^  be  the  actual  utilization  of  the  system  and  Ub  be  the  utilization  bound  of  the 
system. 

In  System  2,  let  Ul^  be  the  actual  utilization  of  the  system  and  Ug  be  the  utilization  bound  of  the 
system,  k'  is  the  least  common  multiple  of  Ti  and  7^  divided  by  T2. 

Equations  for  UAt  Ubj  are  obtained  from  [3].  Equations  2,  3,  5  are  used  to  determine  U'g. 

Now  we  compare  the  safety  margins  of  the  two  systems,  over  0  <  Ci  <  Ti .  We  define  the  increase  in 
Safety  Margin  to  be 

{Ug-UA)-(UB-UA)  (9) 


We  define, 


*  =  r2-riL^j 


and 


y  =  7’2-7’iL^J 


We 


m 


LffJ 
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This  assumption  is  valid  because,  as  shown  in  the  previous  section,  k  is  decreased  by  changing  T2  in 
such  a  way  as  to  obtain  the  largest  possible  common  factors  between  (T2  mod  Ti)  and  T\.  All  possible 

t’ 

values  of  (T2  mod  Ti)  can  be  obtained  by  changing  Tj,  but  still  maintaining  =  [^J 

Also,  z 

Since  the  periods  of  the  tasks  are  changed  only  to  improve  utilization  bounds,  it  is  reasonable  to 
assume  that  small  changes  of  periods  are  preferable. 

The  Figure  2  represents  the  relationship  between  the  utilization  bound  and  Ci  (the  execution  time 
of  the  first  task),  for  both  systems. 

Case  1  :  T2  is  increased  by  2.  The  actual  system  utilization  decreases.  Clearly  for  some  ranges  of  Ci ,  the 
utilization  bound  increases  for  system  2  and  hence  there  is  an  improvement  in  the  safety  margin. 
In  the  other  ranges  of  Ci ,  equations  for  utilization  bound  are  considered  and  the  conditions  under 
which  there  is  an  improvement  in  the  safety  margin  of  system  2  are  derived.  Refer  Appendix  B. 

Case  2  :  T2  is  decreased  by  2.  For  some  ranges  of  Ci,  there  is  an  improvement  in  the  utilization  bound 
of  system  2.  But  since  T2  is  decreased,  there  is  an  increatse  in  the  actual  system  utilization.  Hence 
for  all  ranges  of  Ci ,  the  conditions  for  the  improvement  of  the  safety  margin  are  derived  from  the 
equations  for  the  actual  utilization  and  the  utilization  bound  of  the  two  systems.  Refer  Appendix 
B. 


A  method  to  choose  z 


From  the  results  obtained,  we  describe  below,  a  method  to  choose  2. 

If  the  possibility  of  variation  of  the  computation  time  of  the  first  task  (Ci)  is  known  to  be  in  the 
low  ranges  ie.,  below  x,  then  an  increase  in  T2  by  z  would  give  an  increase  in  safety  margin.  The  only 
restriction  on  2  is  that  2  <  ^,  since  this  would  mean  that  2  <  ^,  since  k  >  k  . 

If  it  is  known  to  be  in  higher  ranges  ie.,  above  x,  and  the  utilization  of  the  system  is  in  the  range 


Ti  -  Cl 

T2 


then  an  increase  in  safety  margin  can  be  obtained  by  decreasing  T2  by  z,  where  z  is  chosen  to  satisfy  the 
conditions, 


and  2  < 


In  both  cases  there  is  an  assumption  that  2  <  ^  .  Since  we  change  T2  in  order  to  reduce  k,  ife  <  ib 

and  hence,  we  can  choose  2  <  ^.  Also  after  2  is  determined,  the  condition  should  be 

satisfied. 


3.5  Using  the  Results 

In  this  section  we  see  how  the  results  obtained  in  section  3.2  and  section  3.3  can  be  applied.  In  section  3.2 
we  determined  that  if  the  two  tasks  start  at  different  times,  then  depending  on  the  difference  in  starting 
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T2‘=T2  +  t 


T2‘=T2-2 


Figure  2:  Variation  of  utilization  bound  with  C\ 


times,  there  is  an  increase  in  utilization  factor  for  some  ranges  of  Ci,  where  the  value  of  utilization  is 
minimum.  In  section  3.3  the  optimal  difference  in  starting  time  is  used.  From  the  equations  for  [/(old) 
i.e.,  both  tasks  starting  at  the  same  time,  and  the  equations  of  [/(new)  i.e.,  with  the  optimal  time 
difference  between  the  starting  points  of  the  two  tasks,  we  see  that  [/(new)  is  definitely  improved  by  a 
positive  value  ^  for  /  >  5  j  for  /  <  2-  Thus,  just  by  having  the  tasks  to  start  at  different 

times,  and  especially  if  the  difference  is  optimal,  the  upper  bound  of  utilization  factor  can  be  improved. 

Another  way  to  improve  the  processor  utilization  bound  is  by  changing  slightly  Ti  or  T2  or  both. 
[/(old)  is  a  function  of  /.  f  is  a  positive  fraction  0  <  /  <  1.  The  minimum  value  of  [/(old)  occurs  for 
/  =  0.4.  [/(old)  increcises  as  the  value  of  /  moves  away  from  0.4  towards  0  or  1.  Small  changes  in  / 
cause  small  changes  in  U(old).  Thus,  it  is  possible  to  increase  U(old)  by  changing  the  value  of  /.  Since 
the  minimum  value  of  [/(old)  lies  at  /  =  .414,  the  value  /  of  should  be  moved  away  from  .414. 

[/(new)  is  a  function  of  both  /  and  k.  With  change  in  /  it  behaves  in  a  similar  fashion  to  U(old) 
except  that  some  values  of  /  give  lower  values  of  k,  and  for  such  values,  [/(new)  improves  drastically. 
The  best  value  of  k  is  2,  and  this  gives  a  utilization  factor  of  1.  Thus,  it  is  possible  to  increase  the  upper 
bound  to  utilization  factor  by  decreasing  the  value  of  k.  In  particular  if  ifc  =  2,  [/(new)  improves  to  1. 
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However  for  very  high  values  of  k,  say  above  20,  the  change  in  [/(new)  for  small  change  in  k  is  small. 

Small  changes  in  Tj,  with  Ti  constant,  introduce  small  changes  in  /.  Small  changes  in  Ti,  with  T2 
constant,  introduce  either  small  or  large  variations  in  /. 

Consider  Ti  being  constant  and  T2  being  varied  slightly.  There  could  be  only  a  small  increase  in 
U(old).  However,  this  could  result  in  a  large  increase  in  U(new),  since  changes  in  T2  can  reduce  ifc  to  a 
low  value. 

k  can  be  reduced  in  the  following  way.  /  =  ^  /  can  also  be  written  as  jf  ^ 

a,  6  being  relatively  prime  integers,  then  /b  =  6.  Thus,  if  Ti  is  prime,  or  T2  mod  Ti  is  prime,  or  Ti  and 
T2  mod  Ti  are  relatively  prime  then  k  =  Ti  will  be  high  since  I  <  k  <Ti.  Otherwise  /  =  can 

be  rewritten  as  /  =  |  where  a,  b,  i  are  integers  such  that  mod  Ti  =  a  -  i,  and  Ti  =  b  i  and  »  ^  1,  so 
k  =  b,  b  <  Ti  and  so  ib  is  small.  The  closer  i  is  to  Ti  mod  Ti  the  better. 

EXAMPLE 

Here  we  present  a  numerical  example  which  illustrates  the  gains  possible  by  modifying  initiation  times 
and  periods.  Consider  the  case  when  T\,  not  a  prime  number  is  held  constant  and  T2  can  be  varied 
slightly.  T2  can  be  adjusted  so  that  T2  mod  Ti  has  a  common  factor  with  Ti,  as  close  to  T2  mod  Ti  as 
possible.  Let  Ti  =  18. 

1.  If  T2  =  26  then  mod  Ti  =  8,  GCD  =  2. 

y  —  ^ 

ib  =  9 

Uiold)  =  .829 

U(new)  =  Ml 


2.  If  T2  =  27  then  T2  mod  Ti  =  9,  GCD  =  9. 

f  _  TjmodTi  ^  ^  _  1 

y  _  —  - 

ib  =  2 

U(old)  =  .833 
U(new)  =  1.0 

3.  If  T2  =  28  then  Ti  mod  Ti  =  10,  GCD  =  2. 

e  ^  T:amodyi  5 

y  _  ^ 

k  =  9 

U(old)  =  .841 
Uinew)  =  .873 


From  the  above  three  cases  it  is  evident  that  if  the  original  value  of  T2  was  either  26  or  28,  then 
by  just  changing  it  to  27  the  utilization  bound  can  be  improved  from  .867  to  1.0  and  from  .873  to  1.0, 
respectively. 

When  T2  is  changed  to  27  from  28  however,  there  is  an  increase  in  the  actual  utilization.  If  for 
instance  Ci  =  5  and  C2  =  8,  the  actual  utilization  increases  from  .564  when  T2  =  28  to  .574  when 
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T2  =  27.  But  this  incrccise  is  small  when  compared  to  the  improvement  in  utilization  bound.  Also  it  is 
not  always  necessary  to  decrease  T2,  an  increase  in  T2  could  also  improve  the  utilization  bound  while 
decreasing  the  actual  utilization.  In  either  case,  the  safety  margin,  which  is  the  difference  between  the 
actual  utilization  and  the  bound,  may  be  increased  considerably. 

Keeping  T2  constant  and  changing  Ti  could  also  similarly  improve  U  (new)  by  reducing  k  sufficiently. 

However  it  should  also  be  noted  that,  since  U{new)  is  a  function  of  both  k  and  /,  in  some  cases  a 
reduction  in  the  value  of  k  alone  is  not  sufficient  to  guarantee  an  increase  in  U(new).  Usually  the  best 
improvement  can  be  obtained  when  the  original  value  of  /  is  close  to  .4.  If  the  original  value  of  /  is 
close  to  0  or  1,  the  utilization  bound  is  already  high  and  only  when  changes  in  Ti  or  Ta  reduces  k  to  2, 
an  increase  is  guciranteed  because  U{new)  improves  to  1. 

Also,  this  method  of  having  different  starting  points,  to  improve  the  upper  bound  to  utilization  can 
only  be  used  if  the  ratio  of  the  periods  of  the  two  tasks  is  rational.  This  question  however  does  not  arise 
in  practice  because  both  periods  would  be  derived  from  the  system  clock,  and  hence,  their  ratio  will 
always  be  rational. 


4  Many- Task  Systems 

4.1  Introduction 

When  we  consider  a  general  n  task  system,  an  analytical  approach  to  determine  the  optimal  initiation 
times  for  the  tasks,  by  avoiding  critical  instants,  in  order  to  improve  the  utilization  bound  is  extremely 
complex.  This  is  because  of  the  dependance  of  the  structure  of  the  lower  priority  periods  on  higher 
ones.  Here  we  describe  an  algorithm  to  determine  the  initiation  times  for  the  task  set,  which  results  in 
a  better  utilization  bound  and  hence  a  better  safety  u.argin.  This  algorithm  favors  tasks  with  a  higher 
priority  ie.,  tasks  with  a  lower  period.  In  [5]  the  period  transformation  method  is  described,  by  which 
the  periods  of  the  more  important  tasks  is  transformed  to  values  smaller  than  periods  of  less  important 
tasks  or  vice  versa.  This  makes  the  priority  of  the  tasks  equal  to  its  criticality  .  This  idea  could  be  used 
to  enhance  the  performance  of  our  algorithm.  We  also  describe  a  method  to  reduce  the  periods  slightly 
and  obtain  a  better  utilization  bound. 

In  [1],  a  stocastic  analysis  of  the  performance  of  the  rate  monotone  algorithm  is  presented.  A 
task  set  is  generated  randomly  and  the  computation  times  are  multiplied  by  a  small  factor  6  and  6  is 
systematically  increased  to  a  threshold  value  at  which  some  task  deadline  is  missed.  The  utilization 
corresponding  to  this  value  of  6  is  defined  as  the  breakdown  utilization.  We  evaluate  the  performance 
of  our  algorithms  by  determining  the  breakdown  utilization  of  the  task  set  whose  initiation  times  are 
determined  or  whose  periods  are  changed  by  our  algorithms,  and  comparing  it  with  the  breakdown 
utilization  obtained  with  the  original  specifications.  It  must  be  noted  that  at  the  point  of  breakdown 
the  system  may  not  be  fully  utilized.  It  may  still  be  possible  to  increase  the  computation  times  of  some 
of  the  tasks  and  still  keep  the  system  schedulable.  However  evaluating  the  system  by  determining  the 
breakdown  utilization  is  a  more  realistic  approach. 
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4.2  Determining  Initiation  Times 

4.2.1  -An  Algorithm  to  determine  Initiation  Times 

The  system  consists  of  n  periodic  tasks  ri,  T2.,.r„  with  periods  T\,T2,  ...Tn  and  execution  times  Ci,C2,  ■■■Cn, 

respectively.  Without  1<»8  of  generality  it  is  assumed  that  Ti  <  T2 . <  T„,  and  also  that  Ti,T2...T„ 

are  all  integers.  The  task  with  a  smaller  period  has  a  greater  priority  and  the  tasks  are  scheduled  by  the 
rate  monotone  scheduling  algorithm.  The  critical lity  of  a  task  is  assumed  to  be  the  same  as  the  priority 
given  to  the  task. 

The  algorithm  described  below  improves  the  safety  margin  of  the  system  by  initiating  each  task  at  a 
specific  time.  The  initiation  time  for  each  task  r,-  is  determined,  taking  into  account  all  the  tasks  with 
a  higher  priority,  because  by  the  rate  monotone  scheduling  algorithm  the  schedulability  of  a  particular 
task  is  affected  only  by  tasks  of  higher  priority.  The  optimal  initiation  time  of  a  task  r,-  maximizes  the 
available  time  for  the  execution  of  the  task,  assuming  that  the  the  tasks  ti  ,  r2,..T,_i  are  initiated  at  their 
respective  optimal  initiation  times.  The  algorithm  fits  tasks  into  the  schedule  one  by  one,  starting  with 
the  task  with  the  highest  priority  and  going  down  to  the  last  task  in  the  order  of  decreasing  priority. 

1.  The  greatest  common  divisor  of  7’i,T2 . T„  and  C\,C2 . Cn  is  determined  to  be  gcd. 

2.  The  task  ri  is  initiated  at  <  =  0. 

3.  Each  subsequent  task  is  initiated  at  times  <  >  0  to  determine  the  optimal  initiation  time  for  the 
task.  We  do  not  consider  initiation  times  <  <  0  due  to  symmetry. 

4.  The  initiation  time  for  task  i  is  determined  in  step  ». 

5.  Si  denotes  the  initiation  time  for  task  i  and  Si  denotes  the  optimal  initiation  time  for  task  j. 

6.  LCMi  denotes  the  least  common  multiple  of  7’i,T2,  ....71. 

7.  denotes  the  maximum  possible  value  of  C,  for  which  the  task  r,’  remains  sche  'ulable,  as¬ 
suming  Ti,  r2...r,_i  are  initiated  at  5i,52...5<_i  respectively  and  r,-  at  Si. 

8.  6,'_i  denotes  the  smallest  possible  busy  CPU  block  or  free  CPU  block  at  the  beginning  of  step  % 
when  Sj  is  a  multiple  of  ^i-t. 

Consider  step  i  ie.,  tasks  •ri,r2.  .r,_i  are  initiated  at  5i, 52, ...5._i  respectively  and  5,-  is  being 
determined.  We  do  not  consider  C,  since  is  to  be  determined. 

Consider  the  region  t  =  stot  =  LCMi +5,  where  s  is  the  initiation  time  of  r,.  This  can  be  divided 
into  k  regions  R\,R2...Rk,  each  of  size  Tj, 

Free  CPU  blocks  and  busy  CPU  blocks  (see  Figure  3)  are  defined  within  each  region  Rx  where 
X  =  l,2..../b.  A  free  CPU  block  is  a  continuous  region  within  a  region  Rx  where  the  CPU  is  idle. 

A  busy  CPU  block  is  a  continuous  region  within  a  region  Rx  where  the  CPU  is  busy. 

In  step  1,  =  gcd  and  5i  =  0.  In  each  step  i,  i  ranging  from  2  to  n, 

1.  6i=6i.i/2 

2.  Si  is  varied  from  0  in  steps  of  6,-  upto  the  greatest  common  diviso*-  of  Ti  and  LCMi-i-  For  each 

Si,  Ci„,„^  is  determined.  The  value  of  .si  that  maximizes  is  the  optimal  value  Si. 
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Figure  3:  Free  a'  -l  Busy  C'f'U  Blocks 


We  do  not  concern  ourself  with  the  computational  complexity  of  this  algorithm  right  now  because 
the  rate  monotone  schedule  is  designed  olT-liiie. 


4.2.2  Effectiveness  of  the  algorithm 

Our  algorithm  does  not  always  determine  optimal  initiation  limes  However,  we  present  here  some 
observations  about  the  structure  of  the  periods  of  the  task  system,  which  sugg>-st  that  the  rwults 
obtained  using  our  algorithm  will  be  close  to  optimal. 

1.  The  region  between  t  —  Si  {si,  a  multiple  of  #,_i)  and  t  =  I.CSf,  can  be  divider)  into  k  regions  Rr 
a;  =  1,2,  ..F  each  of  size  T,  .  We  define  these  regions  to  be  k-rrgtons 

The  differe'  e  between  the  time  available  for  the  execution  of  r, ,  in  any  two  regions  R^.  is  a 
multiple  c  ,_i. 

This  is  because  any  free  of  busy  CPU  block  in  any  region  is  a  multiple  of  r^.-i 

2.  Starting  from  0  or  any  multiple  of  if  we  move  the  initiation  time  of  t,  by  any  distance  x  l<?ss 
than  or  equal  to  6i_i,  then  the  change  in  available  time  for  execution  of  r,  in  any  region  either 

1.  increases  exactly  by  x 

2.  decreases  exactly  by  x 

3.  remains  the  same. 

The  only  extreme  cases  to  be  considered  .  re  shown  in  Figure  4. 

Over  all  the  regions  between  s,-  and  LCMi  +  s,,  the  total  gain  in  lime  for  execution  of  r,  equals 
total  loss. 

3.  From  the  above  two  observations  hence,  maxima  for  the  time  available  for  the  execution  of  r, ,  will 

be  obtained  only  at  s,  values  that  are  multiples  of  6,  =  . 

In  the  extreme  case,  two  regions  Rt  and  Rj  differ  by  ^r-i,  the  time  available  for  the  execution  of  r, 

in  Ric  is  X  and  in  Rj  is  x  +  Also  x  is  the  minimum  time  available  for  the  execution  of  r,  over 
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Figure  4:  Exlrrirne  Ca.ses 


ail  the  regioas  Ri,  Now  if  the  initiation  time  is  moved  by  6,  -  and  time  available  for 

the  execution  of  r,  in  Rt  is  x  +  ^,  and  Rj  is  x  +  i,  and  this  is  a  maximum.  Hence  if  all  values  for  s, 
that  are  multiples  of  6i  are  considered,  a  maximum  value  for  the  lime  available  for  the  execution 
time  of  Tj  will  not  be  missed. 

4.3  An  Algorithm  to  determine  the  Periods 

From  the  analysis  of  a  two  taisk  system  we  determined  that  the  utilization  bound  of  the  system  irnprtjves 
when  the  k  value  (k  is  LCM  of  Ti  and  divided  by  Tj)  is  reduced,  or  in  other  words  the  greatest  common 
divisor  of  the  periods  is  increased,  especially  when  the  value  of  /  is  close  to  0  414  (f  is  -  [p^J)  Our 
algorithm  for  a  general  n  task  system  is  based  on  this  idea. 

We  assume  that  it  is  allowable  to  reduce  the  periods  of  the  tasks  by  a  specified  percentage  (say  x). 
This  is  a  reasonable  assumpt’on  because  if  you  consider  a  common  real-time  application  such  as  data 
sampling,  a  faster  rate  of  sampling  is  often  desirable. 

The  tasks  in  the  set  are  ri,r2....r„,  with  periods  7],  ..  .  Tn  respectively  with  T\  <  T2  <  ...  <  T„. 

The  new  periods  are  determined  by  an  exhaustive  search  over  the  range  .Ox  *  T,  to  T, ,  for  each  task  t, 

1  <  i  <  n. 

1.  Ti  and  r2  are  considered,  and  a  combination  of  periods  that  gives  the  smallest  k  value  is  chosen. 
Let  them  be  T]  and  T2  respectively. 

2.  The  lea-st  common  multiple  of  Tj  and  is  determined  Let  it  be  LCM2 

3.  For  3  <  t  <  n. 
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is  chosen  so  as  to  obtain  the  smallest  k  value  or  largest  greatest  cornrnnion  divisor  between  'f' 
and  LCMi-i,  where  LCMi  is  the  least  common  multiple  of  Also  T)  closer  to  T,  is 

favoured. 

The  new  LCMi  is  determined, 

4.4  Simulation  and  Results 

Random  task  sets  are  generated  with  each  task  set  having  i)  two  tasks  each  ii)  three  task.^  each  iii)  five 
tasks  each  For  each  of  the  above,  the  average  breakdown  utilizations  are  determined  over  50  laisk  sets. 

The  task  sets  are  generated  by  selecting  relative  periods  and  computation  times  uniformly  distributed, 
and  then  scaling  the  computation  time  to  give  a  schedulable  task  set.  The  breakdown  utilization  is 
obtained  by  scaling  the  computation  time  systematically  till  it  becomes  unschedulble. 


4.4.1  Improvements  due  to  change  in  Task  Initiation  times 

The  initiation  times  for  the  tasks  are  determined  and  a  comparison  of  the  breakdown  utilization  i.s  made 
with  the  task  system  that  has  all  tasks  initiated  together.  The  results  are  tabulated  in  Table  1. 


Table  1 


The  results  indicate  the  following  trends 

•  There  is  a  definite  improvement  in  the  average  breakdown  utilization  of  the  task  sets  when  the 
initiation  times  of  the  ta-sks  is  changed.  This  is  because  for  each  task  the  critical  instant  is  avoided, 
each  of  the  k-regxons  is  considered  and  the  initiation  time  is  so  cho.sen  as  to  maximize  the  minimum 
time  available  for  the  task.  Hence  each  task  has  a  little  more  computation  time  available  and  so 
breakdown  utilization  increases. 

•  As  the  number  of  tasks  in  the  task  set  increases,  the  is  an  increa.sc  in  the  improvement  of  the 
utilization  bound.  This  is  an  interesting  observation  and  it  suggests  that  as  the  size  of  the  ta.sk 
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set  increases  a  lot  more  can  be  gained  by  shifting  the  initiation  times.  This  is  because  the  time 
available  for  the  execution  of  each  task  is  maximised,  each  task  gains  a  little  and  when  the  number 
of -tasks  increases  the  effect  is  cumulative. 

•  We  also  observe  that  the  extent  of  the  improvement  is  very  small  (an  average  of  .5  percent). 
However  these  are  average  values  and  if  individual  task  sets  are  considered,  the  actual  improvemnt 
in  breakdown  utilization  could  go  up  to  6  to  8  percent. 

•  The  above  results  also  indicate  that  unschedulable  task  sets  can  be  made  schedulable  and  Table  2 
has  some  examples. 


task 

period 

computation 

tasks  initiated  together 

tasks  initiated  at  specific  times 

priority 

time 

max  comp  time 

init  time  max  comp  time 

Two  task  system 


1 

100 

45 

100 

0.0 

100.0 

2 

250 

120 

unschedulable 

115 

22.5 

137.5 

Three  task  system 


1 

120 

30 

120 

0.0 

120.0 

2 

150 

65 

90 

15.0 

105.0 

3 

200 

30 

unschedulable 

25 

112.5 

57.5 

Five  task  system 


1 

100 

20 

100 

0.0 

100.0 

2 

150 

45 

no 

10.0 

120.0 

3 

350 

50 

150 

162.5 

150.0 

4 

400 

110 

unschedulable 

100 

45.0 

122.5 

5 

720 

20 

115.0 

27.5 

Table  2 


The  only  change  that  would  have  to  be  made  is  in  the  initiation  time  of  the  tasks.  Therefore  there 
is  no  change  in  the  actual  utilization  of  the  task  set.  However  we  see  an  improvement  in  the  breakdown 
utilization  and  hence  an  improvement  in  the  safety  margin  of  the  system,  at  practically  no  cost. 


4.4.2  Improvements  due  to  change  in  Task  Periods  and  Initiation  Times 

As  the  range  of  the  periods  decreases  the  improvement  in  average  breakdown  utilization  obtained  by 
shifting  the  initiation  time  increases.  Refer  4.4.1. 

Also  if  the  periods  are  harmonic,  which  is  the  case  in  a  lot  of  real-time  applications,  then  the 
improvement  in  the  breakdown  utilization  is  better.  Refer  4.4.1. 

The  periods  of  the  tasks  are  changed  slightly  (allowed  a  reduction  up  to  10  percent)  and  then  the 
initiation  times  are  determined.  The  breakdown  utilization  is  determined  and  the  results  are  compared 
with  1.  The  original  system  2.  The  system  with  change  in  periods  only,  with  all  tasks  initiated  together. 
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The  results  are  tabulated  in  Table  3. 


number  of  tasks 
in  task  set 

actual  util 

breakdown  util 

safety  margin 

improvement  in 
safety  rnarlgin 

All  tasks  initiated  together 

2 

0.699809 

0.940779 

0.240970 

- 

3 

0.699616 

0.912586 

0.212926 

- 

5 

0.699276 

0.871999 

0.172723 

- 

Periods  changed  and  all  tasks  initiated  together 

2 

0.727403 

0.953687 

0.229657 

-0.011313 

3 

0.732088 

0.930850 

0.198762 

5 

0.737715 

0.888321 

0.150606 

-0.022117 

Periods  changec 

and  tasks  initiated  at  specific  times 

2 

0.727403 

0.978309 

0.259060 

0.180900 

3 

0.732088 

0.221478 

0.008552 

5 

0.737715 

0.931120 

0.193406 

0.020737 

Table  3 

I  I 

The  results  indicate  that 

•  There  is  a  substantial  improvement  in  the  breakdown  utilization  of  the  task  set.  Since  task  periods 
are  reduced,  there  is  also  an  increase  in  the  actual  utilization  of  the  system,  however,  this  is 
less  than  the  improvement  obtained  in  the  breakdown  utilization  and  hence  there  is  an  overall 
improvement  of  about  1  percent  in  the  safety  margin.  For  an  n  task  system  a  utilization  bound 
of  1  can  be  obtained  by  making  the  period  of  the  n'*  task  a  multiple  of  all  other  task  periods  [3]. 
This  means  a  k  value  of  1.  For  each  task  we  maximize  the  bound  by  decreasing  the  k  value  and 
providing  additional  gain  by  changing  initiation  times. 

•  We  also  observe  that  if  the  periods  alone  are  changed,  and  the  tasks  all  initiated  together  there  is 
a  small  improvement  in  the  breakdown  utilization  but  because  of  the  increase  in  actual  utilization 
there  is  a  decrease  in  the  safety  margin. 

•  It  should  be  noted  here  also  that  these  are  average  values  and  if  specific  task  sets  are  considered 
the  improvements  in  safety  margin  are  more  substantial. 

Thus  an  improvement  in  the  safety  margin  of  the  overall  system  at  no  run-time  cost.  In  fact  there 
may  also  be  improvement  in  application  performance  if  the  periods  of  sampling,  monitoring  etc.,  are 
reduced. 


5  Conclusion 

We  have  presented  some  enhancements  to  the  rate  monotone  algorithm  which  exploit  application  char¬ 
acteristics  to  avoid  worst  case  situations  and  improve  the  safety  margin  of  critical  real-time  systems 
without  run-time  penalty. 
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We  have  determined  the  optimal  initiation  times  for  a  2-task  system  using  the  rate  monotonic  schedul¬ 
ing  algorithm,  and  the  resulting  increase  in  the  utilization  bound.  It  is  only  possible  to  improve  the 
bound  in  some  cases,  however  sometimes  significant  improvements  may  be  obtained.  In  addition,  the 
extent  of  improvement  in  the  bound  depends  on  the  relationship  of  the  periods  of  the  two  tasks.  Con¬ 
trary  to  intuition,  sometimes  reducing  task  periods  may  sometimes  substantially  increase  the  bound, 
and  transform  a  previously  infeasible  task  set  into  a  feasible  one.  By  changing  the  periods  and  using  the 
optimal  initiation  times  an  increase  in  the  safety  margin  is  also  obtained. 

The  techniques  used  for  the  two-task  case,  which  involve  analyzing  the  structure  of  task  arrival  pat¬ 
terns,  and  determining  the  relationship  between  periods,  increase  quickly  in  complexity  as  the  number  of 
tasks  increases.  So  algorithms  were  designed  for  determining  initiation  times  and  periods,  and  improve¬ 
ments  in  safety  margin  were  obtained.  Also  as  the  size  of  the  task  set  increases  the  gains  due  to  modified 
initiation  times  increases.  Reducing  periods  also  sometimes  produces  a  gain  in  the  safety  margin. 

Results  are  sufficiently  promising  so  that  a  system  designer  may  wish  to  put  in  the  effort  to  select 
the  best  initiation  times  and  task  periods,  to  enhance  the  schedulability  and  safely  of  the  system. 
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Abstract 

This  paper  presents  a  novel  strategy  for  scheduling  tasks  in  a  distributed  real-time  system.  The  system 
is  composed  of  several  processors  that  communicate  via  dedicated  links.  Tasks  in  the  system  are  either 
periodic  or  aperiodic.  The  schedule  for  each  processor  of  the  system  must  be  constructed  dynamically 
as  aperiodic  tasks  arrive  at  unpredictable  times.  Each  processor  of  the  distributed  system  is  capable 
of  executing  any  of  the  aperiodic  tasks,  while  the  periodic  tasks  are  executed  locally.  The  scheduling 
strategy  is  divided  into  two  components,  a  local  scheduling  strategy  responsible  for  timely  execution  of 
tasks  arriving  at  a  processor  and  a  global  strategy  responsible  for  the  selection  and  stochastic  transfer 
of  those  tasks  that  cannot  be  executed  locally.  The  global  scheduler  uses  a  stochastic  and  learning 
algorithm  to  reduce  the  number  of  late  tasks.  The  simulations  identify  the  circumstances  in  which 
stochastic  scheduling  is  superior  to  deterministic  scheduling. 


I  Introduction 

Real-time  systems  are  defined  as  those  systems  in  which  the  correctness  of  the  system  depend.s  not  only  on 
the  logical  result  of  computation,  but  also  on  the  time  at  which  the  results  are  produced.  These  systems 
can  be  characterized  by  presence  of  tasks  that  have  timing  requirements,  such  as  deadlines.  Examples 
of  real-time  systems  include,  process  control  systems,  aircraft  flight  control  systems,  nuclear  power  plant 
safety  systems  and  air  traffic  control  systems.  Such  systems  must  perform  certain  actions  in  a  timely 
manner  and  their  failure  to  do  so  may  result  in  sever  consequences.  A  number  of  such  systems,  such  as 
nuclear  power  plant  control  systems  and  local  area  networks  controlling  the  operation  of  an  aircraft  carrier, 
are  composed  of  processes  that  are  inherently  distributed,  suggesting  the  possibility  of  using  a  distributed 
system  for  their  implementation.  The  number  of  practical  systems  implemented  as  distributed  systems 
has,  however,  been  limited  by  the  difficulty  of  scheduling  real-time  tasks  in  a  distributed  system. 

The  scheduling  of  real-time  tasks  for  a  distributed  system  has  received  considerable  attention  in  recent 
years  fsee  [1]  for  a  survey).  Most  of  the  proposed  strategies  are,  however,  aimed  at  environments  whore  tasks 
are  fully  characterized  in  advance  and  are  suitable  only  for  applications  that  operate  in  a  static  environment. 

‘Partially  supported  by  .NationaJ  Science  Foundation  grant  number:  N'CR-9016361 
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In  contfcist,  a  dynamic  scheduling  strategies  allows  the  system  to  handle  tasks  that  are  unexpected  and 
occur  at  unpredictable  times  (6,  8,  9].  This  additional  flexibility  increases  the  system’s  ability  to  adapt  to 
external  events  and  permits  for  exception  handling,  but  also  reduces  system’s  predictability.  A  possible 
compromise  is  to  guarantee  timely  execution  of  periodic  tasks  known  at  system  initialization  time  and 
then  apply  a  dynamic  scheduling  strategy  to  tasks  that  arise  at  unpredictable  times.  This  compromise 
incorporates  the  flexibility  of  a  dynamic  scheduling  strategy  and  the  predictability  of  a  static  scheduling 
strategy.  The  stochastic-learning  scheduling  strategy  uses  such  a  compromise  in  a  loosely  coupled  network. 

The  stochastic-learning  scheduler  uses  a  tw'o-level  scheduling  strategy  for  scheduling  of  real-time  tasks 
in  a  distributed  system.  A  real-time  task  that  enters  the  system  through  a  given  processor  is  first  processed 
by  the  local  scheduler  at  that  processor.  If  the  local  scheduler  cannot  meet  the  timing  requirements  of  a 
task  the  global  scheduling  algorithm  selects  another  processor  of  the  distributed  system  and  transfers  the 
task  to  that  processor  for  remote  execution.  The  local  scheduler  of  the  remote  processor  attempts  to  fit 
this  remote  task  into  its  existing  schedule  so  that  its  timing  requirements  are  met.  The  global  scheduler 
sends  and  receives  information  about  its  past  decisions  and  uses  this  information  to  make  better  decisions 
in  the  future. 

Real-time  systems  are  often  used  to  control  critical  appUcations  but  may  experience  brief  periods 
during  which  some  processors  of  the  system  are  overloaded.  The  real-time  control  system  must  minimize 
the  number  of  tasks  that  miss  their  timing  requirements  during  overload  situations.  While  a  simple  local 
scheduling  strategy  is  sufficient  for  scheduling  of  tasks  in  non-overload  situations,  a  global  scheduling 
strategy  is  required  to  handle  situations  where  the  demand  on  some  processors  of  the  distributed  system  is 
higher  than  their  capacity.  Under  such  circumstances,  the  real-time  system  should  continue  to  execute  tasks 
that  satisfy  their  timing  requirements  and  should  minimize  the  number  of  tasks  that  miss  their  deadlines. 
This  paper  uses  a  stochastic-learning  scheduling  algorithm  to  reduce  the  number  of  tasks  that  miss  their 
deadlines.  While  many  metrics  are  important  in  evaluating  a  real-time  system,  the  most  important  measure 
for  such  an  evaluation  is  the  proportion  of  tasks  that  miss  their  timing  requirements.  This  paper  evaluates 
the  stochastic-learning  scheduling  algorithm  with  respect  to  this  metric  and  compares  its  performance  with 
several  other  algorithms. 

The  rest  of  this  paper  is  organized  as  follows.  Section  2  describes  the  model  of  the  distributed  jystem 
and  the  tasks  in  the  system.  This  section  also  outlines  the  overall  structure  of  the  scheduler.  Section  3  is 
a  description  of  the  local  and  global  scheduling  strategies  and  of  the  optimal  guarantee  procedure.  In  this 
section  data  structures  used  by  the  local  and  global  schedulers  are  also  described.  Section  4  summarizes  the 
simulation  results  identifying  circumstances  under  which  the  stochastic-learning  global  schedubng  strategy 
is  superior  to  other  global  scheduling  strategies.  Some  concluding  remarks  and  directions  for  future  research 
in  this  area  end  the  paper  in  Section  5. 


II  System  Description 

The  system  under  consideration  is  a  distributed  real-time  system  composed  of  several  homogeneous  pro- 
ces.sors  that  communicate  by  exchanging  messages  on  unidirectional  dedicated  links.  Both  processors  and 
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links  are  assumed  to  be  fault  free  and  any  processor  of  the  distributed  system  can  execute  any  of  the 
aperiodic  real-time  tasks.  Furthermore,  the  local  clocks  of  all  processors  in  the  system  are  assumed  to  lie 
synchronized  and  only  real-time  tasks  are  assumed  to  be  present  in  the  system.  real-time  task  i.-  is  M.ai 
as  a  task  that  must  complete  its  e.xecution  prior  to  its  assigned  deadline. 


II-A  Task  Model 

There  are  two  types  of  real-time  tasks  in  the  system,  periodic  and  aperiodic.  A  periodic  task  consists  of 
a  computation  that  is  e.xecutcd  repeatedly,  once  in  each  fixed  period  of  time.  An  example  of  a  periodic 
task  is  reading  of  a  sensor  or  generating  a  control  output.  An  aperiodic  task  consists  of  a  computation 
that  responds  to  internal  or  external  events.  This  type  of  task  occurs  in  the  system  just  once  and  at 
unpredictable  times.  Tj'pical  usage  of  such  tasks  includes  responding  to  operator  requests  and  exception 
handling.  Since  the  periodic  tasks  are  known  in  advance  they  can  be  guaranteed  timely  execution  on  their 
local  processor.  The  aperiodic  tasks,  on  the  other  hand,  are  not  known  a  priori  and  therefore,  may  or 
may  not  be  guaranteed  timely  execution  in  the  processor  in  which  they  originate.  It  may  be  necessary  to 
e.xecute  some  of  these  tasks  at  a  different  processor  to  meet  their  timing  requirements. 

A  periodic  task,  Ti,  is  characterized  by  a  pair:  {C,-,  P,},  representing  its  computation  time  and  its  period, 
respectively.  A  periodic  task  must  be  e.xecuted  exactly  once  during  each  period.  It  is  not  important  when 
the  task  is  executed  during  its  period.  In  a  real-time  system  there  are  a  number  of  such  tasks  each  having 
its  own  period.  The  deadline  of  the  jth  instance  of  a  periodic  task,  Ti  can  be  calculated  by  j  x  P^.  This 
value  also  represents  the  time  at  which  the  j  -j-  1st  instance  of  the  periodic  task,  T,  is  ready  to  be  e.xecuted. 
An  aperiodic  task,  Tfe,  arrives  into  the  system  at  unpredictable  times  and  is  characterized  upon  its  arrival 
by  its  deadlines,  Dk  and  its  computation  time,  Ck-  Such  a  task  can  be  scheduled  for  e.xecution  any  time 
after  its  arrival.  Both  periodic  and  aperiodic  tasks  are  assumed  to  be  preemptable.  It  is  also  assumed 
that  the  set  of  periodic  tasks,  their  characteristics  and  their  assignment  to  various  processors  of  the  system 
is  known  at  the  system  initialization  time.  The  characteristics  of  an  aperiodic  task  becomes  known  only 
when  it  arrives  at  a  processor.  Furthermore,  each  processor  has  sufficient  processing  power  to  guarantee 
timely  execution  of  the  periodic  tasks  assigned  to  that  processor.  This  <issumption  is  enforced  by  the  local 
scheduler  of  each  processor  which  during  the  initialization  phase,  checks  that  the  set  of  periodic  tasks  is 
schedulable.  The  following  lemma  establishes  the  necessary  and  sufficient  conditions  for  schedulability  [7] 
of  a  set  of  preemptable  periodic  tasks.  It  states  that  a  set  of  preemptable  periodic  tasks  is  schedulable  as 
long  as  the  processor  is  not  overloaded. 


Lemma  1  Let  Tp  =  {Ti,T2, . . . ,  T„}  be  a  set  of  preemptable  periodic  tasks.  Tp  is  schedulable  if  and  only  if: 


T. 


No  a.s.sumptions  are  made  regarding  tlie  arrival  rate,  the  minimum  inter-arrival  time,  the  compul alien  liim^ 
or  the  deadline  of  an  aperiodic  task.  However,  the  simulation  model  implies  that  t.a-sks  are  iiidep^Muleiii 
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and  there  are  no  precedence  constraint  among  them.  Moreover,  it  is  assumed  that  the  only  resource  shared 
between  the  tasks  is  processing  time. 

The  characteristics  of  tasks  that  are  guaranteed  for  execution  on  a  proce.ssor  are  kept  in  tlie  Node 
Task  Table  (NTT),  which  contains  the  deadline,  remaining  computation  time,  earliest  start  time  and  the 
identification  number  of  each  guaranteed  task.  The  characteristics  of  the  periodic  tasks  is  kept  in  tiie 
processor’s  Periodic  Task  Table  (PTT).  The  PTT  is  used  to  extend  the  NTT  whenever  necessary. 

II-B  Overall  Structure  of  the  Scheduler 

The  scheduling  strategy-  considered  for  this  study  consists  of  two  components,  a  local  scheduler  and  a 
global  scheduler.  A  central  idea  used  in  both  components  is  the  notion  of  a  guaranteed  task.  A  task  is 
said  to  be  guaranteed  by  a  processor  of  the  distributed  system  if  the  task  runs  to  completion  prior  to  its 
deadline  on  that  processor  and  does  not  cause  any  of  the  previously  guaranteed  tasks  to  miss  their  timing 
requirements.  The  local  scheduler  attempts  to  guarantee  the  timely  e.xecution  of  a  task  by  executing  it 
locally.  The  global  scheduler  attempts  to  guarantee  the  timely  execution  of  a  task  by  e.xecuting  that  task 
on  a  remote  processor.  Periodic  tasks  are  guaranteed  to  meet  their  timing  requirements  and  are  executed 
locally.  Aperiodic  tasks  that  arrive  into  the  system  wait  until  they  are  processed  by  the  local  scheduler. 
The  local  scheduler  attempts  to  guarantee  aperiodic  tasks  by  examining  its  current  load.  When  a  task 
cannot  be  guaranteed,  that  task  is  reconsidered  by  the  global  scheduler  which  probabilistically  selects 
another  processor  for  the  task  and  forwards  the  task  to  that  processor. 

Once  the  global  scheduler  has  sent  a  task  to  a  remote  processor,  the  local  scheduler  at  that  processor 
must  determine  if  the  task  can  be  executed  in  a  timely  fashion.  Tasks  that  arrive  as  a  result  of  a  global 
scheduling  decision  are  treated  as  if  they  were  newly  arrived  tasks  e.xcept  that  they  are  labeled  as  remote 
tasks.  If  a  remote  task  cannot  be  guaranteed  to  meet  its  timing  requirements  then  that  task  is  discarded 
and  removed  from  the  system.  The  decision  to  discard  such  a  task  is  based  on  the  observation  that  the 
processor  deemed  best  capable  of  guaranteeing  its  timely  e.xecution  was  unable  to  do  so.  Furthermore,  some 
time  lias  been  spent  transferring  the  task  to  the  remote  processor.  Therefore,  the  probability  that  some 
other  processor  is  able  to  guarantee  the  timely  execution  of  this  task  is  quite  small.  Sending  unguaranteed 
tasks  to  a  second  remote  processor  generates  more  traffic  in  the  system  and  increases  the  length  of  the 
communication  queues.  Moreover,  sending  tasks  to  a  second  remote  processor  decreases  the  available 
processing  time  by  increasing  the  processing  required  for  scheduling  message  handling  and  system  task.s 
without  corresponding  benefits.  Simulation  results  have  shown  that  sending  tasks  to  a  second  remote 
processor  reduces  the  system’s  performance  and,  therefore,  this  idea  w^as  not  investigated  further. 


Ill  The  Scheduling  Strategy 

The  scheduling  of  real-time  tasks  in  a  distributed  system  can  be  viewed  as  a  two- level  scheduling  activity, 
.\t  the  lower  level  the  local  scheduler  at  each  processor  attempts  to  guarantee  the  timely  execution  of 
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a  real-time  task  by  calling  the  guarantee  procedure.  The  guarantee  procedure  considers  the  processor's 
current  workload  and  the  future  instances  of  all  periodic  tasks  to  determine  whether  the  processor  has 
sufficient  processing  power  to  guarantee  the  timely  execution  of  the  task  under  consideration.  At  the 
higher  level  the  global  scheduler  is  responsible  for  finding  the  most  likely  processor  that  can  guarantee  the 
timely  e.xecution  of  a  locadly  unguaranteed  task. 

III-A  Local  Scheduler 

The  local  scheduler  itself  is  implemented  as  a  periodic  task.  To  reduce  the  number  of  tasks  that  miss  their 
deadlines,  the  local  scheduler  is  also  invoked  upon  arrival  of  a  new  task  if  doing  so  will  not  cause  any  of 
the  previously  guaranteed  tasks  to  miss  their  deadlines.  When  the  local  scheduler  is  invoked  it  considers  in 
order  each  of  the  tasks  that  have  arrived  at  the  processor  since  its  last  invocation.  The  local  scheduler  calls 
the  guarantee  procedure  once  for  each  task  to  determine  whether  that  task  can  be  guaranteed  locally.  If  a 
task  is  not  guaranteed  locally  then  that  task  is  handled  by  the  global  scheduler.  Once  a  task  is  guaranteed 
for  execution  on  a  processor  its  computation  time,  its  deadline,  its  earliert  start  time  and  its  identification 
number  are  entered  into  the  appropriate  row  of  NTT. 

The  local  scheduler  uses  the  earliest-deadline-first  algorithm  to  schedule  tasks  [3].  A  task  with  an 
earlier  deadline  wiO  be  scheduled  to  run  before  a  task  with  a  later  deadline  i/both  tasks  are  ready  to  bo 
executed.  This  algorithm  has  been  shown  to  be  optimal  [2]  for  a  single  processor  and  will  find  a  feasible 
schedule  if  one  e.xists.  A  schedule  is  considered  feasible  if  ail  real-time  tasks  are  able  to  meet  their  timing 
requirements. 

The  NTT  is  implemented  as  two  ordered  lists.  The  first  list  is  a  ready  list  which  contains  all  the  tasks 
that  are  ready  to  be  executed.  Tasks  on  the  ready  list  have  start  times  that  are  smaller  than  or  equal  to 
the  current  time  and  are  ordered  in  the  order  of  increasing  deadlines.  The  second  list  is  a  waiting  list  of 
periodic  tasks  that  are  not  yet  recidy  to  be  executed,  but  are  guaranteed  timely  execution.  Future  instances 
of  periodic  tasks  must  be  considered  by  the  guarantee  procedure  when  it  guarantees  an  aperiodic  task. 
Tasks  on  this  second  list  are  ordered  in  the  order  of  increasing  st2Lrt  times.  A  task  that  moves  from  this 
list  into  the  ready  list  is  inserted  into  the  ready  list  according  to  its  deadline.  Upon  completion  of  the 
currently  executing  task  that  task  is  removed  from  the  NTT  and  the  dispatcher  in  the  processor  is  invoked 
to  select  the  top  item  on  the  ready  list  for  execution. 

.411  non-system  tasks  are  assumed  to  be  preemptable.  System  tasks,  such  as  the  local  scheduler  or 
the  global  scheduler,  are  considered  to  be  more  critical  and  hence,  are  not  preemptable.  When  a  task  is 
preempted  that  task  will  be  placed  back  on  the  ready  list  according  to  its  deadline  and  its  computation  lime 
is  updated  to  reflect  the  processing  time  it  has  received  so  far.  Preemptions  can  be  planned  or  unplanned. 
.4  planned  preemption  is  the  result  of  a  periodic  task  with  an  earlier  deadline  than  the  currently  executing 
task  entering  the  ready  list.  Such  preemptions  cannot  cause  any  of  the  guaranteed  tasks-to  miss  their 
deadlines  since  they  are  accounted  for  by  the  local  scheduler.  Unplanned  preemptions  may  occur  as  a 
result  of  a  task  or  a  message  entering  a  processor.  Such  preemptions  are  allowed  if  and  only  if  they  will  noi 


395 


cause  any  of  the  previously  guaranteed  tasks  to  violate  their  timing  requirements,  ^inplanned  preemptions 
can  only  invoke  a  system  task,  whose  characteristics  are,  of  cour.se,  known  in  advance. 

III-B  The  Guarantee  Procedure 

The  guarantee  procedure  developed  for  this  study  is  an  optimal  procedure  in  the  sense  that  it  will  guarantee 
a  task  if  and  only  if  that  task  will  not  cause  any  of  the  previously  guaranteed  tasks  to  miss  their  deadlines 
and  if  there  is  sufficient  time  left  to  satisfy  the  timing  requirement  of  this  task.  This  procedure  is  called 
by  the  local  scheduler  once  for  each  aperiodic  task  that  arrives  at  a  processor  of  the  distributed  system. 
The  guarantee  procedure  decides  whether  each  task  can  be  guaranteed  to  receive  enough  processing  time 
to  complete  its  execution  prior  to  its  assigned  deadlines. 

The  guarantee  procedure  operates  by  placing  a  task  temporarily  in  the  NTT  and  checking  that  all 
previously  guaranteed  tasks  still  meet  their  deadlines.  Furthermore,  since  the  periodic  tasks  are  guaranteed 
to  complete  prior  to  their  deadlines,  it  may  be  necessary  to  consider  periodic  tasks  that  arc  not  yet  ready 
to  e.xecute.  The  guarantee  procedure  achieves  this  by  extending  its  current  window  of  scheduling.  The 
time  interval  from  the  earliest  start  time  to  the  latest  deadline  of  all  the  tasks  in  the  NTT  is  defined  as 
the  current  window  of  scheduling.  The  current  window  of  scheduling  can  be  extended  by  either  the  local 
scheduler  or  by  arrival  of  an  aperiodic  task  with  a  deadline  later  than  the  latest  deadline  of  all  tasks  in 
the  NTT.  In  both  cases  the  current  window  of  scheduling  is  extended  by  an  integral  multiple  of  the  least 
common  multiple  of  the  periods  of  aU  the  periodic  tasks  in  the  PTT. 

The  local  scheduler  may  decide  to  extend  the  window  of  scheduling  because  the  time  is  close  to  the 
start  time  of  the  next  instance  of  a  periodic  task  and  that  task  is  not  in  the  NTT.  The  local  scheduler 
will  then  extend  the  current  window  of  scheduling  by  appending  periodic  tasks  to  the  NTT.  If  the  current 
window  of  scheduling  is  extended  by  the  arrival  of  an  aperiodic  task,  then  deadline  of  the  aperiodic  tjisk 
must  fall  within  the  newly  extended  window'  of  scheduling,  ensuring  that  all  periodic  tasks  possibly  affected 
by  the  arrival  of  this  aperiodic  task  are  accounted  for. 

After  the  NTT  is  extended  to  account  for  all  periodic  tcisks  that  may  be  affected  by  the  aperiodic  task 
under  consideration,  we  must  consider  whether  the  aperiodic  task  can  also  be  guaranteed.  The  necessary 
and  sufficient  condition  for  schedulability  for  a  set  of  preemptable  tasks  is  established  in  the  following 
lemma. 

Lemma  2  Let  Tp  =  {Ti,T-2, . . .  ,Tn}  be  a  set  of  preemptable  tasks,  where  T,  =  (C,, 5,,  Dj).  C,  is  the 
computation  time,  Si  the  earliest  start  time  and  Di  the  deadline  ofTi.  Let  Tp  be  sorted  in  non-decreasing 
order  of  start  times  (i.e.  for  any  pair  of  tasks  T,  and  Tj  if  i  >  j,  then  Si  >  Sj).  Tp  is  schedulable  if  and 
only  if: 

j 

ViVj(J2Ci)<  Dj-S,. 

k=i 
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The  lemma  requires  that  for  each  task  in  the  set  Tp,  the  guarantee  procedure  ensures  that  there  is 
sufficient  time  to  process  the  task  and  all  guaranteed  tasks  that  are  in  Tp  and  start  later.  To  implement 
this  scheduling  condition  requires  time  proportional  to  where  n  is  the  number  of  the  tasks  in  the  NTT. 
This  implementation  would  be  too  costly  in  a  dynamic  real-time  system.  However,  an  incremental  version 
of  the  above  schedulability  condition  can  be  implemented  in  time  proportional  to  n.  This  incremental 
version  rebes  on  two  observations.  First,  the  start  time  of  the  task  under  consideration  is  equal  to  the 
current  time  because  only  aperiodic  tasks  can  cause  a  call  to  the  guarantee  procedure.  Second,  it  is 
possible  to  construct  a  feasible  schedule  for  a  set  of  periodic  tasks  that  are  not  yet  ready  to  execute.  This 
observation  is  based  on  a  schedulabibty  test  that  is  performed  during  the  system  initialization  phase  for 
a  set  of  preemptable  periodic  tasks.  These  observations  can  be  incorporated  to  change  the  schedulability 
condition  for  a  set  of  preemptable  tasks  where  the  schedule  is  constructed  incrementally  by  including  one 
additional  task  at  a  time. 

Lc'TTia  3  Let  Tp  =  {Tj ,  72, . . . ,  be  a  set  of  preemptable  tasks,  where  Ti  =  {Ci,Si,  Di).  C,  is  the 
computation  time,  S,  the  earliest  start  time  and  Di  the  deadline  of  Ti.  Let  Tp  be  sorted  in  non-decreasing 
order  of  start  times  and  let  Tp  be  schedulable  according  to  lemma  2.  Then  Tp  uTo  is  sckedulable  if  and  only 

if: 

vj  ct  ^  -  ■^0. 

k  =  Q 


This  lemma  states  that  a  task  can  be  added  to  a  set  of  previously  guaranteed  tasks  given  that  there  is 
enough  processing  time  to  guarantee  the  timely  execution  of  this  task  and  all  the  guaranteed  tasks  in  the 
set.  Since  the  tasks  are  preemptable,  there  might  be  several  tasks  with  start  times  equal  to  the  current 
lime.  In  such  cases  we  assume  that  the  task  being  considered  by  the  guarantee  procedure  precedes  all  other 
tasks  with  start  times  equal  to  the  current  time.  The  costs  of  preemption  and  running  the  dispatcher  are 
ignored  in  this  study,  however,  these  costs  can  be  considered  as  part  of  the  task’s  computation  time.  Note 
that  the  incremental  version  of  the  guarantee  procedure  is  vaUd  only  for  task  sets  where  the  task  that  is 
being  considered  has  start  t-nie  equal  to  the  present  time  and  the  only  tasks  that  have  start  times  in  the 
future  are  periodic  tasks  that  satisfy  the  schedu’rbility  condition  of  lemma  1. 

III-C  Stochastic-Learning  Global  Scheduling  Strategy 

The  stochastic-learning  strategy  is  based  on  a  real-time  extension  of  the  stochastic-learning  automata 
[d,  .5,  S].  The  stochastic-learning  global  scheduler  at  each  processor  of  the  distributed  system  keeps  a 
variable  L  that  is  proportional  to  the  load  on  that  processor.  L  is  defined  as  the  fraction  of  the  busy  time 
to  the  total  time  in  the  e  rrent  window  of  scheduling.  This  information  is  broadcast  periodical!  V  p3.rt 
of  an  update  message  which  also  contains  the  identification  of  the  source  of  the  message.  The  period  of 
the  update  messages  is  a  tunable  system  parameter.  Using  the  update  information  received  from  other 
proces.sors  and  the  value  of  its  own  variable  X,  the  global  scheduler  at  each  procc.ssor  can  calculate  the 
system’s  average  load.  It  is  then  able  to  label  each  processor  as  underloaded  or  overloaded.  This  assignment 
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Figure  I:  The  operation  of  local  and  stochastic-learning  global  scheduling  strategy  for  a  processor  of  the 
distributed  system. 

of  underloaded  and  overloaded  to  each  processor  constitutes  the  state  of  the  system  as  observed  by  each 
global  scheduler.  Different  processors  of  the  system  may  observe  different  states  of  the  system  due  t  ■  tlie 
different  delays  inher  it  in  the  communication  medium. 

The  probability  vectors  for  each  possible  state  of  the  system  are  combined  to  form  a  probability  matri.x 
P,  the  rows  of  which  correspond  to  the  observed  state  of  the  system.  An  element  of  the  probability  matri.x. 
P,j  at  processor  k,  represents  the  probability  that  processor  k  will  send  an  unguaranteed  task  to  processor 
j  should  processor  k  find  the  system  in  state  i.  The  diagonal  elements  are  set  to  zero  to  prevent  a  processor 
form  sending  an  unguaranteed  task  to  itself.  Furthermore,  the  probabibty  of  sending  a  task  to  a  processor 
that  is  overloaded  is  set  to  zero.  Initially  unguaranteed  tasks  are  sent  to  underloaded  processors  with  equal 
probabilities. 

The  learning  occurs  as  the  result  of  processors  guaranteeing  or  rejecting  remote  tasks.  Let  us  assume 
that  processor  r’s  estimate  of  the  state  of  the  system  is  S  and  that  it  has  sent  a  task  to  processor  j.  If 
the  task  is  not  guaranteed,  the  acdon  of  sending  an  unguaranteed  task  to  processor  j,  given  processor  Fs 
estimate  of  the  state  of  the  system,  was  an  incorrect  action.  Therefore,  the  probability  of  i  sending  another 
task  to  j  should  be  reduced  when  i  finds  the  system  in  the  same  state,  S.  If  the  task  was  guaranteed  by 
J.  the  probability  of  sending  a  task  to  processor  j  should  be  increased  when  i  finds  the  system  in  slate  S. 
For  Ijoth  ca.ses  tlie  row  of  P  in  processor  i  that  is  updated  must  correspond  to  the  observed  system  state. 
S.  .Vote  that  the  source  processor  i  sends  its  current  observed  state  of  the  system,  i.e.  S,  along  with  the 
task  and  processor  j  sends  a  message  to  inform  processor  i  whether  the  task  was  guaranteed  or  rejected 
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If  the  task  was  rejected  thea  llie  elements  of  the  elfeciive  row,  i.e.  row  .S',  are  upd.iE<‘ii  h\  'J,*' 
penalty  function: 

Pm{fi  +  1  }  =  /k-.in)  -r  A',,  A  l\n{n)  m  t  j 
Pj  (  a  +  1 )  =  P,  (  n  ]  -  A  p  X  ^  I  // ) 

If  the  task  was  guaranteed  then  llie  elements  of  the  effective  row,  i.e.  5,  are  updated  i-;>  th.- 
reward  function: 

Prn{n  +  1)  =  PnAjl)  ~  A'r  X  Pmi'l)  r  J 
Pj{n  f  1 )  =  /j(a)  -f  Kr  X  ^  /’mini 

In  the  above  equations  n  refers  to  th  current  value  ol  the  elements  of  the  jjrobahility  ui.i'ti'.  a:,t:  •  ; 

to  the  next  instance  of  them.  Note  that  the  above  computation  is  performed  periodnaliv  as  jiar!  -d  t;.. 
global  scheduler  and  that  both  A'p  and  A'r  are  tunable  parameters. 


Figure  1  is  a  pictorial  representation  of  the  local  and  global  schedulers  that  reside  at  ea<  h  jirtH  r  -,,,- 
the  distributed  system.  Each  juocessor  perioiiically,  wiili  period  r.  checks  atnl  ii;>dates  its  hxa!  / 

and  broadcasts  it  to  all  other  processors  in  the  'system.  Using  this  information,  earii  proress<u  d<''<  rm:i,>‘" 
the  state  of  the  system  at  this  lime.  Once  the  local  scheduler  finds  a  task  that  it  cannot  Kuarantef  th.-n 
it  sends  that  task  to  a  remote  processor  using  the  row  of  the  probability  matrix  that  curr<-spom>  to  tie' 
current  observed  state  of  the  system.  I.ipon  the  arrival  of  the  remote  task  at  the  destination  jiroccxMir . 
the  local  scheduler  of  this  processor  attempts  to  guarantee  the  task.  Rejected  la.-^ks  are  removed  from  'hr 
system  and  guaranteed  tasks  are  entered  into  the  .N’TT.  The  indication  of  the  guarantee  or  tei'  itiou  ;■ 
sent  back  to  the  source  processor  and  that  processor  updates  the  probability  vector  correspnndini:  ?<■  o* 
observed  state  of  the  system  at  the  time  of  the  initial  transmission. 


IV  Simulation  Model  and  Results 

T.  simulation  model,  programmed  in  C,  consists  of  a  network  of  five  homogeneou,s  processors,  connerud 
as  shown  in  Figure  2.  There  are  five  independent  sources  for  arrivals  of  aperiodic  tasks.  Fach  source  i*. 

modeled  by  a  Poisson  distribution  with  averages  Aj, - A5.  The  A’s  vary  depending  on  the  particuiar  ioad^ 

being  modeled.  Specific  values  of  A’s  correspond  to  average  loads  which  are  presented  with  the  simuia'son 
results  later  on  in  this  section.  When  a  task  arrives  at  a  processor  from  the  external  world,  it  is  assigned  a 
size  according  to  a  Poisson  distribution.  The  average  size  of  the  tasks  is  another  parameter  that  varies  with 
the  particular  load  being  considered.  If  a  task  cannot  be  guaranteed  by  the  processor  at  whirli  it  arn’.x  - 
from  the  external  world,  that  task  may  be  transferred  to  another  processor  of  the  network  The  dc>liv('r\ 
of  tasks  and  other  messages,  such  as  updates,  take  a  time  that  depends  on  two  factors,  the  queueing  tiel.ix 
and  the  transmission  delay.  While  the  queueing  delay  depends  on  the  behavior  of  each  pfoce>c.f.r  i,f  the 
system  the  transmission  delay  is  a  physical  property  of  the  network  and  is  a  global  variable  ciio-en  o. 
reflect  realistic  situations  in  current,  networks.  The  queueing  delay  is  determined  bv  the  siiniit.oion  •lie.i.  l 
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Figure  2  Nelwofk  Topology 

In  a  real-time  system  one  of  the  most  important  parameters  to  consider  is  the  lazily  of  liisks.  Mie  lazily 
a  task,  T,.  is  defined  by:  =  D,  -  C,  -  R,,  where  D,,C,-R,  ''efer  to  the  task’s  deadline,  fornputdiion 

time,  and  ready  time  respectively.  The  la.viiy  of  all  periodic  tasks  in  the  system  is  known  in  ad\;!:i<e 
The  In.xity  of  the  aperiodic  tasks  is  modeled  as  a  Poisson  distribution  and  is  a  system  variable  In  {h« 
simulations  some  of  the  processors  of  the  distributed  sy.stem  are  overloaded  and  others  are  not.  In  none  <•! 
the  cases  however,  is  the  system  as  a  v'.hole  overloaded. 

To  evaluate  the  performance  of  the  stochastic-learning  scheduling  strategy,  two  deterministic  and  three 
baseline  global  scheduling  strategies  are  also  implemented.  The  fust  deterministic  .strategv-  is  a  cent rali/ed 
strategy  where  one  processor  is  responsible  for  making  decisions  about  relocating  tasks  that  raruiot  fe 
guaranteed  locally.  The  second  deterministic  strategy  is  a  cooperative  and  distributed  strategy  m  whei; 
a  set  of  processors  solicit  and  submit  bids  to  acquire  those  tasks  that  cannot  be  guaranteed  iocallv  IT. 
baseline  algorithms  consist  of  an  optimal  global  sclieduiing  strategy,  a  non-cooperative  .strategy  in  w!ii<  if 
tasks  must  bo  executed  locally  and  a  random  global  sclieduiing  strategy.  .All  global  scheduling  strateg;.  - 
use  the  local  scheduler  described  in  the  previous  section  for  local  scheduling. 

In  the  centralized  global  scheduling  strategy-  a  centra!  proces.sor  is  responsible  for  making  global  srlir,iul 
ing  decisions.  A  task  that  cannot  be  guaranteed  by  the  local  scheduler  is  attached  to  a  queue  of  locally 
unguaranteed  tasks.  The  global  scheduler  at  each  non-central  processor  is  responsible  for  dequeuing  of 
these  tasks.  The  central  scheduler  is  a  periodic  task  itself,  but  to  decrease  the  number  of  late  ta.sks  it  ma;. 
also  be  invoked  dynamically.  The  centralized  global  scheduling  strategy  is  a  three  phase  protocol  In  thr 
first  phase  of  the  protocol  the  global  schedulers  at  non-central  processors  of  the  system  inform  the  gloh.ii 
scheduler  at  the  central  processor  of  (he  system  of  (he  locally  unguaranteed  tasks.  In  the  .<;ecr>ii(i  jdia  -' 
the  central  scheduler  tries  to  find  a  suitable  sink  processor  for  each  of  the  locally  unguaranteed  tasks.  If  a 
suitable  sink  processor  is  found  its  identity  is  sent  to  the  source  processor  of  that  unguaranteed  task.  The 
third  phase  of  the  protocol  starts  when  the  source  processor  receives  this  mes.sage.  The  global  schedule; 
on  this  processor  labels  the  task  remote  and  sends  it  to  the  designated  sink  prore,ssoi.  On  tlie  arnv.tl  re 
this  remote  task  at  the  selected  sink  processor,  if  this  task  is  guaranteed  by  the  local  .scheduler,  its  tininm 
information  is  inserted  into  the  appropriate  row  of  the  .N'TT  of  the  sink  procc.ssfir  and  the  schediiif  r  a?  !],• 
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central  processor  is  informed.  If  this  task  is  not  guaranteed  by  the  locid  scheduler,  it  is  removed  from  Un- 
system  and  there  is  no  need  to  inform  the  global  scheduler  at  the  central  processor  since  the  Ni  l  of  this 
processor  remains  unchanged.  The  centralized  global  scheduling  strategy  uses  j)r<)cessor  2  as  the  ceutr.d 
processor  in  the  network  of  Figure  2.  Note  that  this  is  the  best  possible  location  for  the  central  piocesM,; 
as  only  one  processor  is  not  its  immediate  neighbor. 

The  cooperative  strategy  is  an  implementation  of  the  bidding  strategy  introduced  by  Ramantniha.ii 
and  Stankovic  [6,  9].  A  new  task  that  arrives  into  the  system  is  considered  first  by  the  local  scheduh  i 
If  that  task  is  not  guaranteed  by  the  local  scheduler,  it  is  attached  to  a  queue  of  unguaranteed  t;okv 
The  global  scheduler  is  itself  a  periodic  task  but,  it  may  also  be  invoked  dynamically.  I'he  cooperative 
global  scheduler  also  has  throe  phases.  During  the  first  phase  the  global  scheduler  at  the  source  processo; 
generates  and  broadcasts  a  Request  For  Bids  (RB’B)  message.  In  the  s«_‘cond  phase  the  global  schedule!,-, 
of  other  processors  of  the  system  receive  the  RFB  message  and  may  generate  and  send  a  bid  to  the  sounc 
processor.  In  the  third  phase,  the  global  scheduler  at  the  source  processor  evaluates  the  bids  and  award.'- 
the  task  to  the  highest  bidder.  Once  a  remote  task  has  been  received  by  the  sink  processor,  the  local 
scheduler  of  the  sink  processor  runs  the  guarantee  procedure  for  this  newly  arrived  remote  task.  If  the  ta,^k 
is  guaranteed  by  the  local  scheduler  tlic  task  is  placed  in  the  appropriate  row  of  the  XTT.  If  a  remote  t.i.-k 
cannot  bo  guaranteed  by  the  local  scheduler,  it  is  rejected  and  removed  from  the  system. 

The  first  baseline  strategy  is  an  optimal  strategy  which  assumes  tasks  can  be  processed  at  remote  pro 
cessors  with  zero  communication  and  transmission  delays.  This  strategy  effectively  replaces  a  homogeneou.s 
distributed  system  of  ;V  processors  with  a  single  processor  which  is  A  times  faster  than  a  processor  of  the 
distributed  system. The  second  baseline  strategy  is  a  non-cooperative  .strategy.  In  this  strategy  ttisks  that 
cannot  be  guaranteed  for  local  e.xecution  are  rejected  and  removed  from  the  system.  The  last  ba.seline 
strategy  implemented  is  a  random  global  scheduling  strategy  which  sends  locally  unguaranteed  tasks  to 
another  processor  .selected  at  random  from  the  distributed  system.  While  the  optimal  global  scheduling 
strategy  defines  the  upper  bound  on  the  number  of  tcisks  that  can  be  guaranteed  in  the  system,  the  random 
and  non-cooperative  strategies  define  the  lower  bounds  for  the  number  of  tasks  that  should  be  guaranteed 
by  a  global  scheduling  strategy. 

A  number  of  different  metrics  are  measured  in  this  study.  The  result  reported. however,  is  only  with 
respect  to  one  metric,  the  percentage  of  the  tasks  that  complete  their  execution  prior  to  their  as.signf'd 
deadline.  Since  all  periodic  tasks  are  guaranteed  to  meet  their  timing  requirements  by  the  local  scheduler, 
only  the  results  for  aperiodic  tasks  are  considered.  Each  global  scheduling  strategy  has  a  number  of  tunable 
parameters.  All  such  parameters  are  optimized  with  respect  to  the  normal  operation  of  the  network 
where  none  of  the  processors  of  the  distributed  system  are  highly  overloaded.  Once  these  parameters  aro 
optimized,  their  value  is  not  changed. 

Figure  3  shows  the  proportion  of  tasks  guaranteed  as  a  function  of  the  average  load  of  the  system  when 
the  queuing  delays  are  taken  into  account.  Figure  4  shows  the  results  in  an  unrealistic  model  in  whirl; 
queuing  delays  are  ignored.  As  can  be  seen  from  these  two  figures,  the  queuing  delay  has  a  subslantinl  effer t 
on  the  guarantee  ratio  and  the  communication  medium  is  a  significant  part  of  the  network  that  e.rumu 
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be  ignored  in  the  design  of  distributed  real-time  scheduling  algorithms.  Furthermore,  Figure  3  ti  ai 

the  performance  of  the  stochastic-learning  strategy  is  superior  to  that  of  ot!u-r  alguni.hno,  fo:  rre..!.-;..’, 
to  heavy  loads.  This  plieaomena  is  not  observed  in  the  unrealistic  model  of  figure  I  aga  ;.  >ir« . 
importance  of  communication  medium  and  its  roid-lime  impacts. 

In  Figure  3  also  note  that  the  centralized  scheme  does  not  compare  favorably  wiili  either  i  i.e  < 
or  the  stochastic  strategies.  The  stochastic  scheduling  strategy  requires  no  messages  to  Keiei  m  o..;.. 
tasks.  This  reduces  the  delay  in  receipt  of  a  task  thereby  increasing  the  prof  ability  of  guarasiieeniti  s.uu.i. 
tasks.  Both  the  centralized  and  the  cooperative  strategies  require  tivo  me.s.sages  before  t'  ,  v  rmi  •<  .uii 
task  to  a  remote  processor.  The  cooperative  scheme  has,  however,  more  up-to-date  mform.ition  m.  i  .  i 
the  average  achieves  better  results.  Furthermore,  the  central  processor  of  the  system  may  also  !.i . 
congested,  causing  delay  to  request  and  update  messages. 

The  stochastic-learning  strategy-  lacks  the  up-to-date  information  of  the  cooperative  .strategy,  f  or  .  at  i; 
locall  y  unguaranteed  task,  however,  the  cooperative  strategy  sends  two  messages  wiierea,s  the  stochastn 
learning  strategy  sends  only  one.  This  increases  the  traffic  of  the  network.  Furthermore,  the  message  o-u' 
by  the  stochastic  scheduler  is  sent  after  the  remote  task  has  arrived  at  its  destination  The  rof.p.  r;,; . , 
scheduler  requires  two  messages  before  it  can  send  a  ta.sk  to  a  remote  processor.  The  effect  of  theo-  luo 
factors  is  more  pronounced  for  networks  that  have  higher  transmission  delayas  or  .systems  where  the  m<  .o, 
la.\ity  of  real-time  tasks  are  smaller.  Figure  o  shows  the  proportion  of  Uvsks  guaranteed  as  a  (umti-.f.; 
of  the  average  load  of  the  system  when  the  transmission  delay  is  increased  to  10  milliseconds  per  jkk  ket 
per  hop.  Each  packet  is  1000  bits  long  and  tiie  mean  task  size  is  ICOOO  bits.  Comparing  figures  o  atid  :i 
sliows  that  as  transmission  delay  increases  stochastic-learning  scheduling  strategy  becomes  more  atirm  tro 
compared  w'ith  other  strategies.  Similar  results  c.an  be  deduced  by  comparing  figu’-es  G  and  3.  Om  e 
l!io  stochastic-learning  strategy  is  found  superior  to  all  other  global  scheduling  strategies  as  the  mean  . 
of  tasks  is  reduced  from  1450  milliseconds  to  500  milliseconds.  Similar  results  are  found  when  the  ,3 

tasks  or  the  average  load  of  the  network  arc  increased. 

The  results  suggest  some  general  conclusions  about  the  efectivenoss  of  stochastic  and  cooper, oi\<' 
scheduling  strategies.  The  stochastic  scheduling  strategy  performs  better  than  the  cooperative  sij.itegy 
when  the  transmission  delays  are  longer,  the  system  is  more  heavily  loaded,  the  la.Kily  of  tasks  are  smalh  r 
or  the  task  sizes  are  larger.  Thus,  the  stochastic  scheduling  is  more  suitable  for  difficult  .systems  ami  ti 
can  be  used  in  new  applications  where  traditional  deterministic  schemes  are  not  successful.  On  the  otju  i 
hand,  the  deterministic  strategy  is  more  suitable  for  cast/ systems  that  are  not  heavily  loaded,  tasks  ]iav< 
more  laxity,  job  sizes  are  smaller  or  transmission  delays  are  not  significant.  It  is  important  o,  ti,.,. 
there  are  situations  where  no  global  scheduling  strategy  performs  well.  For  e.vample  if  the  size  of  t.!.<-k-  ; 
loo  large,  the  communication  delay  will  be  too  great  regan’s-ss  of  the  scheduling  algorithm  seier te(i  \i 
also  interesting  to  note  that  there  are  situations  where  an\  algorithm  would  be  adequate.  1  hese  jnclud'' 
the  trivial  cases  when  none  of  the  processors  of  the  distributed  system  are  overloaded  and  the  ca'-.-  when 
transmission  delays  are  exceedingly  small. 
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V  Conclusions  and  Future  Research 


This  paper  presents  a  novel  dynamic  sclioduling  strategy  for  real  lime  disiiibuii-d  1  he  je.ibi.  ;r; 

of  scheduling  real-time  tasks  in  a  distributed  system  is  decomposed  into  two  reiaied  Mibpruiil.'!i;>.  1 
local  scheduler  attempts  to  guarantee  the  timely  execution  of  real-time  tasks  by  considering  the  curr. m  'o,,,; 
on  a  single  processor.  Second,  tasks  that  cannot  be  guaranteed  locally  are  sent  to  anollier  processor  of  t! 
distributed  system  that  has  a  high  probability  of  executing  remote  tasks  in  time.  'I'he  stochastic  Mr-tt.  .,; , 
learns  the  best  po.ssible  action  for  each  state  of  the  network.  i*criodic  messages  are  broad<  ast  and  collei 
to  determine  the  state  of  the  network.  The  stochastic-learning  scheduling  strategy'  is  compared  wit  is  s.-ve;.,; 
other  global  scheduling  strategies.  The  simulation  results  demonstrate  that  stochastic  learning  strategv 
i.s  superior  to  deterministic  strategies  in  many  realistic  situations  and  they  also  indicate  tiiat  stocha-t;. 
strategies  extend  tlic  domain  of  real-time  distributed  controllers  by  successful  scheduling  tasks  in  Uijjii  u!! 
situations.  In  the  future  this  research  will  consider  heterogeneous  networks  and  precedence  and  mutu.d 
exclusion  relations  among  the  real-time  tasks. 
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Proportion  of  Tasks  Guaranteed  Proportion  of  Tasks  Guaranteed 


I'lgure  3,  Proportion  of  tasks  guaranteed  as  a  function  of  mean  load  on  each  node  of  the  distributed 
system.  In  tins  figure  the  queueing  due  to  the  congestion  of  the  links  is  modeled  Note  that  the  optimal 
algorithm  does  not  incur  any  form  of  communication  delays  and  hence,  dcKis  not  represent  a  reahsijc 
network. 


Figure  d:  Proportion  of  lask.s  guaranteed  as  a  function  of  mean  load  on  e.ach  node  the  distribute.! 
system  In  this  figure  tl.-  queueing  is  ignored. 
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Figure  5:  Proportion  of  tasks  guaranteed  as  a  function  of  mean  load  on  each  node  of  the  distributed 
system.  In  this  figure  the  transmission  delay  is  increased  to  10  milliseconds  per  packet  per  hop. 
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Figure  6'  Proportion  of  tasks  guaranteed  as  a  function  of  mean  load  on  each  node  of  the  (iistribuierl 
system  In  this  figure  the  mean  laxity  of  tasks  is  reduced  to  500  milliseconds 
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1.  MIM:  Conceptual  decomposition 

When  a  mathematical  model  is  translated  to  a  detailed  structural 
model  for  implementation,  often  the  communication  consideration 
outweighs  the  computational  considerations.  Such  a  structural  model  is 
referred  to  as  the  Massively  Interconnected  Models  (MIM).  Numerous 
examples  exist  in  the  areas  of  parallel  computation,  distributed  computer 
communication  networks,  and  artificial  neural  networks.  The  beam 
forming  problem  in  acoustic  array  processing  is  one  kind  of  Massively 
Interconnected  Model,  which  challenges  the  designers  who  hope  to  exploit 
parallel  processing  technology  of  the  future. 

Beam  Forming  is  a  spatial  filtering  problem.  Signals  are  coming  in 
from  multiple  acoustic  sensors  numbered  in  thousands  that  are  placed 
over  a  spatially  limited  area  (2D).  These  signals  are  processed  and 
enhanced  coherently  so  that  directivity  of  the  arrival  beam  can  be 
discriminated  in  both  the  azimuth  and  elevation  directions.  These  signals 
v'ill  be  transformed  and  manipulated  in  a  mathematical  model  that 
requires  MIM  for  implementation.  The  beam  forming  problem  is  selected 
here  to  demonstrate  a  procedure  necessary  to  describe  typical  MIM 
problems.  It  is  used  also  to  develop  characterization  that  can 
differentiate  among  available  architectures  which  can  fit  the  MIM 
implementation. 

In  this  paper,  the  beam  forming  problem  is  first  considered  in  a 
mathematical  model.  At  this  stage  mathematical  abstract  formular  and 
notations  are  used  to  describe  the  algorithm.  Later,  to  implement  this 
algorithm  a  practical  architecture  was  selected  to  accommodate  the 
processing.  The  architectures  may  be  VLSI  chips,  bit-slice  micro- 
programmable  processor,  multiple  processor  systems  using  DSP  chips  or 
Floating  Point  Unit  hardware  (FPU),  SIMD  machines  such  as  CM-2,  cr  MIMD 
machine  such  as  Hypercube  or  iWarp.  Modeling  using  MIM  is  an 
Intermediate  step  in  which  alternative  partition  of  the  algorithm  is 
studied.  Vital  issues  in  this  step  is  the  occurring  resource  allocation  and 
scheduling  when  MIM  modules  are  fitted  to  the  architectures. 
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Figure  1 .  Design  procedures  of  a  beam  former. 


Therefore,  MIM  modules  are  not  considered  as  direct  hard''  ”"^ 
architectural  components  but  as  a  conceptual  partition  of  the  algorithm. 
This  decomposition  into  MIM  modules  results  in  resource  allocation  and 
scfieduling  problems. 

Since  the  objective  is  to  implement  a  signal  processing  algorithm 
(Beam  forming)  in  hardware  following  a  most  economical  way  that 
includes  the  whole  life  cycle  cost  (hardware/software  design  cost  and 
maintainability  costs),  it  is  necessary  to  develop  measures  that  can  help 
to  answer  the  following  questions: 

1.  What  are  the  capacity  requirements  in  the  DSP  algorithm? 

2.  Among  all  the  architectures,  such  as  VLSI,  bit-slice,  DSP  chips, 
SIMD  machines,  and  MIMD  machines,  which  architecture  can  fit 
the  beam  former  best?  This  is  a  complicated  issue.  The  MIM 
module  decomposition  is  used  to  resolve  this  question. 

3.  How  do  you  decompose  the  algorithm  into  MIM  modules?  The 
decomposition  depends  very  much  on  the  architecture  under 
consideration.  How  are  these  considerations  related? 

2.  Beam  Forming  Algorithm 

Let's  assume  there  is  a  source  in  the  far  field  radiating  a  planar  wave 
at  an  angle  of  0  with  respect  to  the  sensor  array  shown  in  Figure  2.  The 
signal  received  at  sensor  1,  x^(t)  is  a  delayed  version  of  the  signal  X2(t), 

XsO).  .  Xf^^t)  of  the  other  sensors.  Let  the  delay  in  time  be  A 

A  =  d  cos  0  /  X, 


(1) 


where  X  is  the  w'avelength  of  the  source,  d  is  the  distance  between 


Figure  2.  Linear  uniformly  spaced  sensor  array. 


sensors.  Assume  that  we  are  dealing  with  an  evenly  spaced  linear  array  of 
sensors.  The  distance  between  sensor  is  d. 


xi(t)  =  XQ(t-A) 

X2{t)  =  x.,(t-A)  =  Xo(t-A) 


XM(t)  =  XM.i(t-A)  .  .  .  =  Xo{t-MA)  (2) 

If  we  consider  the  Fourier  Transform  of  the  sensor  signals, 

Xi(f)  =  Xo(t) 

X2(f)  =  Xo(f) 


)yf)  =  X(,{f) 
or 
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(3) 


X„(f)  .X„(()  e-i2"fnA 

Here  is  the  broad-band  acoustic  beam  forming  procedure  that  uses 
basically  a  delay-sum  approach.  There  are  two  possible  approaches  for 
broadband  beam  forming  procedure;  time  domain  approach  and  frequency 
domain  approach.  Due  to  easier  computation  requirements  and 
applicability  to  optimal  and  adaptive  beam  forming,  the  frequency 
approach  is  considered  here.  The  signal  flow  diagram  for  a  beam  former  is 
shown  in  Figure  3.  For  each  frequency  a  set  of  Wp,(f)  is  used  in  the 

algorithm. 


M 

Y(f)  =  S  an  Xn(f)  ■  W^’  (f)  (4) 

n=1 

Select, 

(5) 


Equation  (4)  becomes. 

M 

Y(f)  =  Ian  [Xo(f)  (f) 

n=1 


M 

=  San  Xo(f) 

n=1 


M  M 

-  I  anXo(f)  =  Xo(f)  Z  a^ 


n=1 


n=1 


It  is  obvious  that  this  special  selection  of  Wpj(f)  resulted  in  the  coherent 

summation  of  the  beam  former  output,  Y(f).  The  Wp(f)  is  called  the 

steering  vector  that  is  dependent  on  the  angle  of  arrival  Gin  equations  (5) 
and  (1).  In  order  to  hear  the  beam  from  different  directions  the  whole 
operation  in  Figure  3  are  repeated  for  each  direction  angle.  The  steering 
vector  in  reality  also  depends  on  the  source  frequency  as  shown  in 
equation  (5).  Therefore,  even  though  the  direction  of  the  beam  is  fixed, 
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Figure  3.  Frequency  domain  approach  for  beam  former. 


different  sets  of  (f)  have  to  be  provided  for  a  different  temporal 

frequency  in  the  broad  band  spectrum.  If  audio  output  is  required,  the 
spectrum  signal  Y(f)  will  be  inversely  transformed  into  the  time  domain. 

3.  To  Answer  the  Obvious 

Many  people  know  the  answer  to  the  question  of  whether  the  beam 
former  is  a  computation  intensity  operation.  Here  a  set  of  measures  are 
developed  to  answer  this  obvious  question.  The  computation  requirements 
of  a  beam  former  is  determined  from  the  mathematical  model  developed  in 
the  previous  section.  The  model  shown  in  Figure  3  encompasses  the 
equations  and  English  description  of  the  previous  section.  It  is 
mathematical,  but  not  vigorous  and  rigid  from  the  point  of  view  of  a 
computer  language  syntax,  in  order  to  show  it  as  a  computation  intensive 
operation  we  use  the  following  measures. 

MT.  Computation  Bandwiath  (BW)  requirement:  A  real-time  beam 
former  receives  acoustic  signals,  processes  the  data,  and  provides  beam 
information  y(t)  for  a  specific  frequency  to  subsequent  systems.  Due  to 
hard  real-time  delays  required  for  processing  it  is  necessary  to  consider 
processing  capacity  in  unit  time.  Frequency  of  operations  per  unit  time  is 
characterized  in  terms  of  the  computational  BW  requirement 
(Megaflops/sec). 
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M2:  Communication  Bandwidth  (BW)  reQuiremeni  The  beam  former 
typically  processes  data  from  sensors  and  steering  vectors  from  database 
The  hard  real-time  delay  has  to  be  satisfied  These  are  specified  as  I/O 
bandwidth  (BW)  requirements  (bytes/sec)  The  beam  former  algorithm 
requires  certain  capabilities,  and  whether  an  architecture  can  fulfil!  that 
requirement  is  an  important  question. 

M3:  Memory  Bandwidth  ReQuirement:  Very  often  in  order  to  speed  up 
operation  some  of  the  coefficients  can  be  precalculated  and  stored  mto 
the  memory.  Sometimes,  the  memory  reads  and  writes  are  so  frequent  and 
intensive,  the  total  run  time  slows  down  In  real-time  applications  it  is 
also  imperative  to  characterize  this  impact  m  terms  of  memory 
bandwidth  requirement, 

4.  Beam  Former?  Be  SDecilic! 

A  beam  former  (BF)  problem  is  discussed  here.  The  requirement  of 
this  beam  former  depends  very  much  on  the  size  of  the  array  and  the 
number  of  beams  involved.  It  is  essential  to  be  specific  to  analyze  a  beam 
former  problem.  A  Passive  Sonar  Example  Draft  {version  0.02)  created  for 
the  Design  and  Synthesis  Technology  Project  in  Naval  Surface  Warfare 
Center  is  used  here  as  a  guideline.  This  example  allows  fixed  beam 
forming,  steerable  beam  forming,  and  acoustic  channel  output. 
Environmental  and  seasonal  data  can  also  influence  the  operation  of  the 
BF.  For  the  study  here  to  demonstrate  the  idea  of  MIM  decomposition,  this 
draft  system  has  been  simplified  as  follows. 

Very  Simplified  Beam .  F.Qfme£: 

1.  Only  fixed  beam  forming  is  included. 

2.  A  1-D  linear  uniformly  spaced  array  is  considered. 

3.  Total  sensors  is  100. 

4.  Frequency  domain  approach  is  adopted. 

5.  Azimuthal  coverage  of  360°  with  3°  resolution. 

6.  Frequency  coverage  is  from  0-256  Hz. 

The  coefficient  multiplication  with  a^  in  equation  (4)  is  sensor 

dependent.  This  multiplication  is  part  of  the  signal  conditioning 
operation.  It  is  done  in  analog  electronics.  Therefore,  it  is  not  considered 
as  part  of  the  beam  former.  There  are  two  basic  kinds  of  operations  in  a 
beam  former.  The  first  operation  is  the  Discrete  Fourier  transform  using 
Fast  Fourier  Transform  (FFT)  algorithms  To  cover  a  spectrum  to  256  Hz, 
Nyquist  rate  requires  a  sampling  frequencing  of 


fg  =  1/T  =  512  Hz.  Let's  assume  that  N-pofOt  FFT  !S  involved  in  this 
system.  The  total  rate  for  doing  FFT  is 

Ng  •  1/2  •  Nlog2N 

Ptppy  =  -  =  1/2  Ng  IOQ2N  '  f  S 

NT 

where;  Ng  is  the  number  of  sensor  elements,  i.e.  Ng**  100. 

N  =  512  and  T  *=  1/512  sec 

Because  real  data  is  transformed,  the  Hermitian  property  spares  us  from 
calculating  negative  frequency  components.  Therefore,  the  total 

calculation  is  1/2  N  log2N  within  N  T  sec  for  ail  sensor  signal  For  our 
very  simplified  BF, 

RPPP=  1/2  (100)  •  9  512  =  230.4  KFLOPS 

The  second  operation  is  the  dot  product  calculation  shown  in  equation 
(4).  There  are  Ng  number  of  muitiply-add  operations  for  each  beam 

direction.  Since  the  steering  vector  depends  also  on  the  frequency,  with  N 
frequency  the  rate  for  vector  product  is 

NbNgN 

Rqp  = -  =  Njj  Ng  •  f  Q 

NT 

where  is  the  number  of  beams,  i.e.  Nj^  =  120.  For  the  very  simplified  BF. 

Rdp  =  120-1  00-512  =  6.144  MFLOPS 
The  total  computational  bandwidth  of  the  BF  is 
CompBW  =  RppT  +  Rqp  =  6.385  MFLOPS 

The  beam  former  is  one  part  of  the  total  system.  Its  effectiveness 
depends  on  the  input/output  it  can  provide.  The  input  bandwidth 
requirement  comes  from  accepting  the  signals  from  the  conditioner. 
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1/p  BW  =  bytes/sample 


=  100-5122  «  102  4  Kbytes/sec 

Where,  each  sample  from  A/D  convertor  is  a  12-bit  real  value  which  takes 
2  bytes.  The  output  is  fed  to  the  tracker  process  of  the  simplified  sonar 
system.  Each  beam  data  has  4  bytes  per  sample. 

O/P  BW  =  Njj  ■  f  5  •  4  bytes/sample 

=  1205124  »  245.76  Kbytes/sec 

While  the  dot  product  is  in  operation,  it  is  necessary  to  read  out  the 
steering  vectors  for  equation  (5)  in  real-time.  The  memory  bandwidth 
requirement  is, 


MEMBW  =  Njj  •  Ng  f  s  4  bytes/output 

=  120T  00-51  2-4 
=  24.576  Mbytes/sec 

In  summary,  the  mathematical  model  can  be  represented  as  a  VHDL  entity 
with  its  capacity  requirement  as  shown  in  Figure  4. 


I/P  BW 


102.4  Kbytes/sec 


Computational 

BW 

6.385  MFIops 


0/P  BW 


245.76  Kbyles/sec 


MEMBW 

24,576  Mbytes/sec 


Figure  4.  Mathematical  model  of  beam  former  with 
capacity  requirement. 
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5.  Peak  rate  and  sustainable  rate 


As  a  first  look  of  the  capacity  requirement  shown  in  Figure  4,  it  does 
not  seems  to  be  stringent.  Keep  in  mind  that  the  very  simplified  beam 
former  is  a  trivial  example.  Realistic  beam  former  has  much  larger 
capacity  requirements  than  these.  Even  for  such  a  small  task  the 
bandwidth  described  so  far  is  the  sustainable  bandwidth  not  the  peak 
bandwidth.  Many  commercially  available  processors  or  DSP  chips  claim  a 
capacity  in  this  order  of  magnitude.  But.  their  claims  are  universally 
peak  bandwidth.  It  means  under  the  most  ideal  situations  without  delay  of 
data  and  delay  of  instructions  the  processor  can  achieve,  for  example,  10 
MFLOPS  capacity.  The  overall  sustainable  processing  rate  very  much 
depends  on  the  communication  of  data  or  instructions  in  the  system, 
which  is  much  smaller  than  the  peak  rate.  Consequently,  this  is  a  MIM 
type  of  situation  where  the  communication  consideration  outweighs  the 
computational  considerations  in  the  system.  Exactly  how  communication 
affects  the  computation  depends  on  the  type  of  algorithms  involved.  Some 
commercial  systems  can  achieve  a  very  high  CPU  processing  rate  (FLOPS) 
But,  sustainable  rate  for  other  types  of  jobs  is  very  poor.  As  far  as  beam 
forming  is  concerned  sustainable  rate  of  a  potential  candidate 
implementation  or  architecture  is  of  primary  interest. 

7.  Measures _ char..8C-t&fUmq _ architectures 

Before  we  attempt  to  address  the  question  of  how  to  partition 
algorithm  into  MIM  modules,  it  is  necessary  to  characterize  the  MIM 
modules  to  see  whether  it  can  be  accommodated  in  a  particular 
architecture.  In  addition  to  the  bandwidth  requirement  developed 
previously,  it  is  also  essential  to  consider  the  following  measure. 

M4:  FLOPS-I/O  ratio  (a1:  This  is  a  measure  to  characterize  the 
proportion  of  computation  done  versus  communication  (I/O)  required  in  the 
partition.  It  can  be  used  to  describe  the  MIM  module  requirement.  It  can 
also  be  used  to  characterize  the  architecture  element.  If  the  peak  FLOPS- 
1/0  ratio  (a)  of  an  architecture  element  is  /ess  than  the  MIM  module 
requirement,  it  is  possible  to  fit  the  MIM  module  into  the  element.  If  the 
peak  FLOPS-I/O  ratio  (a)  of  an  element  is  greater  than  the  MIM 
requirement,  problem  exists  to  use  this  architecture  element. 

M5:  Latency  -FLOPS  product  (131:  This  is  a  measure  to  compare 
different  decompositions  of  MIM  modules  for  architectural  elements. 
Generally,  an  element  can  be  tuned  to  achieve  a  high  FLOPS  rate,  but  the 
latency  associated  with  the  data  stream  is  also  increased.  Consequently, 
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it  is  necessary  to  compare  the  iatency-FLOPS  product  among  different 
architecture  elements.  For  a  particular  MIM  module  partition  it  is  also 
necessary  to  calculate  the  Iatency-FLOPS  product.  Only  elements  with 
less  Iatency-FLOPS  product  can  accommodate  the  MIM  module  of  larger 
Iatency-FLOPS  product. 

MIM  modules  with  small  FLOPS-IO  ratio  and  Iatency-FLOPS  product  are 
generally  referred  to  as  fine  grain  tasks.  On  the  other  hand,  architecture 
elements  usually  have  limited  achievable  FLOPS-I/O  ratio  and  Iatency- 
FLOPS  product.  Fine  grain  MIM  modules  can  only  be  accommodated  in  fine 
grain  architecture  elements.  Essentially,  the  FLOPS-l/0  ratio  and  the 
Iatency-FLOPS  product  are  measures  to  characterize  computational 
activities  relative  to  communication  activities.  With  these  measures  it 
will  be  easier  to  analyze  the  results  of  different  mapping  approaches. 

8.  MIM  decompositions 

Let’s  assume  that  a  real  beam  former  is  probably  a  hundred  times 
bigger  than  the  very  simplified  example  considered  here.  Realistically  it 
is  not  possible  to  use  a  high  performance  processor  to  accommodate  the 
capacity  requirement  shown  in  Figure  4.  Either  specialized  hardware, 
multiple  processor,  or  parallel  processor  systems  have  to  be  used  to 
accommodate  the  problem.  The  issue  is  about  how  to  partition  the  BF 
algorithm  and  mapping  them  to  a  specific  architecture. 

Up  to  now  the  possible  known  implementations  of  a  beam  former  can 
be  summarized  in  Table  1.  Most  beam  formers  are  implemented  in  bit- 
slice  microprocessors  with  FFT  and  FPU  chips  [1].  Communication  is  done 
point-to-point  through  high  speed  short  run-length  buses.  The 
architecture  is  arranged  by  the  designer.  There  is  some  micro 
programming  involved.  It  is  small  and  just  for  the  purpose  of  controlling 
the  beam  former.  With  all  the  FFT  chips  available  such  as  TRW2310, 
HDSP66110,  and  UT69532,  multiple-bus  structure  using  this  kind  of  chips 
has  been  developed  for  beam  forming.  Real  time  operation  of  this  kind  of 
system  has  been  demonstrated.  Beam  forming  was  done  on  a  Single 
Instruction  Multiple  Data  (SIMD)  parallel  system  such  as  CM-2.  [2].  Even 
for  non  real-time  prototypes,  the  programming  work  is  not  easy.  Other 
kinds  of  parallel  systems  such  as  Multiple  Instruction  Multiple  Data 
(MIMD)  were  used  to  prototype  a  beam  former,  [4].  EMSP  is  one  of  the  Navy 
system  tried  over  the  years.  Programming  and  communication  scheduling 
is  an  important  issue  in  this  kind  of  system.  The  overall  objective  is  to 
develop  a  low  cost  system  for  both  hardware  and  software  costs  over  the 
life  cycle  of  the  system.  More  work  needs  to  be  done  to  demonstrate  these 
new  technology  implementations. 

For  an  algorithm  the  MIM  decomposition  depends  on  the  mathematical 
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model  and  the  available  architectural  elements.  In  Figure  1  to  do  a  good 
MIM  decomposition  it  is  necessary  to  consider  both  the  left  hand  side 
algorithm  and  the  right  had  side  architectural  element.  For  the  bit-slice 
processor  implementation,  you  probably  will  decompose  the  mathematical 
model  into  a  MIM  module  more  along  the  line  of  the  signal  flow  graph 
shown  in  Figure  3.  Basically  the  beam  former  will  have  the  following 
architectural  elements. 

(a)  FFT  chips  for  DFT  algorithm. 

(b)  FPU  chip  for  floating  point  arithmetic. 

(c)  Point-to-point  Fligh  Speed  bus  for  communication, 

(d)  Bit-slice  microprocessor  to  implement  control  sequences. 


If  you  are  considering  a  mesh  connected  iWarps  for  beam  former 
implementation,  a  different  approach  for  MIM  decomposition  may  be  used. 
An  iWarp  processor  is  more  capable  than  doing  a  simple  FFT  job.  The  MIM 
module  can  be  a  combination  of  FFT’s,  floating  point  multiply  and  addition 
together.  The  crucial  issue  is  on  what  will  be  the  grain  size  of  the  MIM 
module,  and  can  it  be  fitted  into  an  iWarp  architectural  element.  The 
aproach  is  to  calculate  the  measures  discussed  previously  for  both  the 
MIM  modules  and  the  iWarp  processor.  Then,  fitting  is  based  on  comparing 
the  measures  of  both  the  MIM  modules  and  the  iWarp  processor. 

9.  MtM  modules  for  bit-slice  Implementation 

Consider  the  FFT  module  for  a  bit-slice  implementation  shown  in 
Figure  3.  The  total  computational  bandwidth  required  is 

1  /2Nlog2N 

CompBW  = - =  i/zioggNfg 

NT 


=  2,304  FLOPS 

This  requirement  is  equivalent  to  do  a  512  point  FFT  transform  in  one 
second.  In  reality  an  FFT  module  in  a  board  can  performa  512  point  FFT  in 
10  msec  continuously.  It  means  that  if  only  computation  BW  is  concerned, 
a  FFT  board  module  can  accommodate  100  MIM  modules. 

In  reality,  it  is  very  important  to  consider  the  communication 
bandwidth  requirement  as  well.  For  the  FFT  MIM  module  in  Figure  3,  Input 
and  output  bandwidth  requirements  are  as  follows: 
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1/P-BW  »  512  -2  bytes/sec 
=  1,024  bytes/sec 


O/P-BW  *512  -4  bytes/sec 
=  2,048  bytes/sec 

CommBW  =  )/P-BW  +  O/P-BW  *  3,072  bytes/sec 

For  example  an  array  processor  board  such  as  PL2500  from  Eighteen  Eight 
Laboratory,  only  the  peak  bandwidth  on  an  AT  bus  was  known  as  3.3 
Mbytes/sec.  Considering  sustainable  rate  and  all  of  the  communication 
overhead,  1%  of  the  peak  rate  was  chosen  as  a  nominal  rate,  30Kbytes/sec. 
Consequently,  a  FFT  hardware  board  can  only  accommodate  10  FFT  MIM 
modules.  This  situation  is  described  in  Figure  5.  In  reality,  both 
computational  bandwidth  and  communication  bandwidth  are  considered. 

The  FFT  board  can  only  accommodate  10  channels  of  FFT  MIM  modules  so 
far. 


each  MIM  module 


FFT 


each  architectural  FFT  boards 

accomodates 
100  MIM 


consider  CompBW  =  2,304  FLOPs 


FFT 


CompBW  =  100  X  2.304  FLOPS 
=  203.4  KFLOPs 


10  MIM 


consider  CommBW  =  3.072  bytes/sec 


FFT 


FFT 


CommBW  =  30  Kbytes/sec 


Figure  5.  Mapping  of  MIM  to  architectural  elements  with 
partial  consideration  of  measures. 


10.  Conclusions 

The  algorithm  and  characteristics  of  a  beam  former  was  introduced 
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and  described.  Due  to  the  nature  of  a  beam  former,  communication 
considerations  outweigh  the  computational  considerations  in  this  problem. 
That  is  a  typical  Massively  Interconnected  Modeling  (MIM)  type  of  problem. 
A  set  of  capacity  measures  in  terms  of  bandwidth  are  developed  here. 
These  measures  are  used  in  the  mapping  process  from  mathematical 
algorithms  to  MIM  modules  and  from  MIM  modules  to  architectural 
elements.  Due  to  available  space  for  discussion  only  the  bit-slice 
approach  for  a  beam  former  is  presented  in  a  simple  analysis  using  the 
capacity  measures.  It  shows  that  communication  considerations  are  the 
main  factor  in  all  fitting/mapping  problems  for  a  beam  former. 
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Abstract 

We  present  two  software  applications  and  develop  models  for  them.  The  first  application 
considers  a  producer-consumer  tasking  system  with  an  intermediate  buffer  task  and  studies 
how  the  performance  is  aiffected  by  different  selection  policies  when  multiple  tasks  are  ready 
to  synchronize.  The  second  application  studies  the  reliability  of  a  fault-tolerant  software  sys¬ 
tem  using  the  recovery  block  scheme.  The  model  is  incrementally  augmented  by  considering 
clustered  failures  or  the  effective  arrival  rate  of  inputs  to  the  system. 

We  use  stochastic  reward  nets,  a  variaint  of  stochastic  Petri  nets,  to  model  the  two  software 
applications.  In  both  models,  each  quantity  to  be  computed  is  defined  in  terms  of  either  the 
expected  value  of  a  rewaurd  rate  in  steady-state  or  at  a  given  time  9,  or  as  the  expected  value  of 
the  accumulated  reward  until  absorption  or  imtil  a  given  time  9.  This  allows  extreme  flexibility 
while  maintaning  a  rigorous  formadization  of  these  quantities. 
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List  of  Symbols 
greek  pi  (lower  case) 

greek  pi  (upper  case) 

greek  lambda 

greek  mu 

greek  nu 

greek  delta 

greek  theta 

gjreek  tau 

greek  rho 

greek  sigma 

greek  alpha 

greek  beta 

greek  xi 

w  (lower  case) 

empty  set,  similar  to  a  zero  with  a  slash  through  it 

strange  R,  for  real  numbers 

strange  N,  for  natural  numbers 

pound  sign 

inhnity  sign 

forall 

membership  in  a  set 

D  sup  minus,  D  sup  plus,  D  sup  small  circle 
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1  Introduction 


Many  applications  demand  high  performance  and  reliability/avajiabiisiy  from  cuuipntrr  sy» 
terns.  Higher  levels  of  integration  and  newer  techniques  in  \H*S1  dessign  have  made  hardware 
with  high  performance  and  reliability,  relatively  inexpensive.  Software,  on  the  other  hand,  is 
betoniing  a  major  cornpouenl  in  the  overall  coat  of  these  systems  [271.  Often,  though,  the  soft 
ware  poses  performance  and  reliability  bottlenecks  which  should  be  discovered  and  eliminated 
Improvements  in  software  assessment  methods  for  the  design  phase  of  the  iKiftware  Hfe  cycle 
akre  required  to  rninimize  costly  redtrsigns  and  changes  due  to  unant  apated  jMTfurrtiancr  or 
reliability  problems. 

Markov  models  have  been  u.s<.*d  for  software  performance  as-sessinen!  [10],  .software  rrliabil 
ity  assessment  (17|,  and  for  analyzing  software  fault-tolerance  [9,  1-1,  '21  j,  Markov  rnDdel.s  have 
been  quite  popular  in  hardw^are  perfortnance  mcKfcls  and  hardware  reliability  models  as  well. 
Reasons  for  the  popularity  of  Markov  models  include  the  ability  to  capture  varioii.s  dependen 
cies,  the  equal  e^ase  with  which  steady-stale,  transient  and  cumulative  transient  measums  can 
be  computed  and  the  extension  to  Markov  reward  models  useful  in  performability  analysis. 
The  main  drawbacks  of  Markov  mo<lcls  include  the  size  of  the  state  space  and  the  assumption 
of  exponentially  distributed  sojourn  limes  It  is  possible  to  remove  the  a'^sumption  of  exponen 
tlal  sojourn  time  distributions  by  using  phase-type  expansions  of  non  exponential  distributions 
(13,  29j.  This  method  converts  a  non  Markovian  problem  into  a  Markovian  one  with  an  even 
larger  state  space. 

Stochastic  Petri  nets  (SPNs)  can  be  used  to  specify  the  problem  in  a  concise  fashion  and 
the  underlying  Markov  chain  can  then  be  generated  automatically.  Algorithm.^  for  storing  and 
efficiently  solving  relatively  large  Mau'kov  chains  have  emerged  and  have  been  implemented  in 
several  packages  (6,  8,  23].  Our  version  of  SPNs,  called  stochastic  reward  nets  (SRNs),  not  only 
allows  the  compact  specification  of  large  Markov  models  but  also  permits  the  concise  specifi¬ 
cation  of  reward  structure  at  the  net  level.  In  this  way,  automatic  generation  of  large  Markov 
reward  models  is  facilitated.  Steady-state,  transient,  and  cumulative  transient  measures  of 
the  resulting  Markov  reward  mod  Is  can  be  computed  [8].  We  illustrate  our  approach  with 
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two  examples:  software  performance  assessment  of  a  prodncer-consuiner  system  ami  reliability 
assessment  of  the  recovery  block,  a  software  fault- tolerance  scheme. 

As  we  will  show,  detailed  behavior  of  the  system  can  be  described  concisely  and  the  effects 
of  various  design  decisions  can  be  predicted  easily.  We  note  that  SRNs  are  also  suitable 
for  hardware  performance,  reliability,  and  pcrformability  analysis,  hence  they  can  be  used  for 
combined  hardware-software  analysis.  Some  aspects  of  system  hardware  are  indeed  represented 
in  the  models  described  in  this  paper. 

Several  papers  are  relevant  to  our  study.  Performance  modeling  of  concurrent  software  has 
been  caLrried  out  using  Markov  chains  [11,  12),  series  parallel  graphs  [16],  queueing  networks 
[28],  stochastic  rendezvous  networks  [31,  30),  and  SPNs  [18,  5,  26).  A  recent  study  by  Leu  et 
al.  uses  SPNs  to  model  fault-tolerant  aspects  of  software  (15). 

Section  2  gives  a  brief  review  of  the  SRN  concepts;  it  adso  contains  an  explanation  for  the 
symbols  used  in  the  paper.  In  Section  3  we  present  the  analysis  of  a  producer-consumer  tasking 
system  and  in  Section  4  we  present  the  analysis  of  the  recovery  block  scheme.  Conclusions  axe 
presented  in  Section  5. 

2  Stochastic  Reward  Nets 

There  are  several  definitions  for  Petri  nets  [19,  20]  and  even  more  for  stochastic  Petri  nets.  Our 
SRN  formalism  allows  only  exponentially  distributed  or  constant  zero  times,  so  its  underlying 
stochastic  process  is  independent  semi- Markov  with  either  exponentially  distributed  or  constant 
zero  holding  times.  We  assume  that  the  semi-Markov  process  is  regular,  that  is,  the  number 
of  transition  firings  in  a  finite  interval  of  time  is  finite  with  probability  one.  Such  a  process 
can  then  be  tramsformed  into  a  continuous-time  Markov  chain  as  it  is  done  for  the  generalized 
stochastic  Petri  net  (GSPN)  formalism  (1). 

The  SRNs  differ  from  the  GSPNs  in  several  key  aspects.  From  a  structural  point  of  view, 
both  formalisms  are  equivalent  to  Turing  machines.  But  the  SRNs  provide  enabling  functions, 
marking-dependent  arc  cardinalities,  a  more  general  approach  to  the  specification  of  priorities, 
and  the  ability  to  decide  in  a  marking-dependent  fashion  whether  the  firing  time  of  a  transition 


is  exponentially  distributed  or  null,  often  resulting  in  more  compact  nets.  Perhaps  more 
important,  though,  are  the  differences  from  a  stochastic  modeling  point  of  view.  The  SUN 
formalism  considers  the  measure  specification  as  an  integral  part  of  the  model.  Underlying 
an  SRN  is  an  independent  semi-Markov  reward  process  with  reward  rates  associated  to  the 
states  and  reward  impulses  associated  to  the  transitions  between  states.  Our  definition  of  SRN 
explicitly  includes  parameters  (inputs)  and  the  specification  of  multiple  measures  (outputs). 
A  SRN  with  m  inputs  and  n  outputs  defines  a  function  from  IR"*  to  IR”. 

We  define  a  non-parametric  SRN  as  an  11-tuple  A  =  {PyT,D~ ,D'^,D°,e,>,fio,X,w,M}, 
where: 

•  P  —  {pi,  ...,p|p|}  is  a  finite  set  of  places.  Each  place  contains  a  non-negative  number  of 
tokens.  The  multiset  describing  the  number  of  tokens  in  each  place  is  Ccilled  a  marking. 
The  notation  #(p,p)  is  used  to  indicate  the  number  of  tokens  in  place  p  in  marking  p. 
If  the  marking  is  clear  from  the  context,  the  notation  #(p)  is  used. 

•  T  =  {ti, i\T\}  is  a  finite  set  of  transitions  {PDT  =  0). 

•  yp  e  P,'^t  e  T,  D-t  :  IN'^I  -♦  IN,  :  IN'^'  IN,  and  :  IN'^'  IN  are  the 
marking-dependent  multiplicities  of  the  input  arc  from  p  to  t,  the  output  «irc  from  i  to 
p,  and  the  inhibitor  arc  from  p  to  t ,  respectively.  If  an  2irc  multiplicity  evaluates  to  zero 
in  a  marking,  the  arc  is  ignored  (does  not  have  any  effect)  in  that  marking. 

We  say  that  a  transition  <  €  T  is  arc-enabled  in  marking  p  iff 

Vp  e  p,  D-M  <  #(p,p)  A  {dim  >  #o-,p)  V  d;m  =  o) 

When  treuisition  t  fires  in  marking  ft  the  new  marking  p'  satisfies: 

Vp  e  P,  4{p,p')  =  #(p,  p)  -  -f- 

•  Vt  €  T,  ej  :  IN^^I  —*  {true,  false}  is  the  enabling  function  of  transition  t.  If  et(p)  =  false, 
t  is  disabled  in  p. 
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•  >  is  a  transitive  and  irreflexive  relation  imposing  a  priority  among  transitions.  In  a  mark¬ 
ing  /i,  ti  is  marking-enabled  iff  it  is  arc-enabled,  =  true,  and  no  other  transition 

t2  exists  such  that  >  ti,  is  arc-enabled,  and  —  true.  This  definition  is  more 

flexible  than  the  one  adopted  by  other  SPN  formalisms,  where  integers  are  associated 
with  transitions  (e.g.,  imagine  the  situation  where  tx  >  fj,  tz  >  f*,  but  ii  has  no  priority 
relation  with  respect  to  <3  and  t4). 

•  fio  is  the  initied  marking. 

•  Vt  e  T,\t  :  IN'^I  IR+  U  {00}  is  the  rate  of  the  exponential  distribution  for  the  firing 
time  of  transition  t.  If  the  rate  is  00  in  a  marking,  the  transition  firing  time  is  zero. 
This  is  a  generalization  of  [1],  where  transitions  are  a  priori  classified  as  “timed”  or 
“immediate”.  In  this  paper,  though,  there  are  transitions  for  which  the  rate  is  always 
00.  We  still  call  them  immediate  and  we  represent  them  with  a  thin  baur  instead  of  a 
hoUow  rectangle.  The  distinction  between  vanishing  and  tangible  maLrkings  introduced 
in  [1]  is  still  applicable:  a  marking  p  is  said  to  be  vanishing  if  there  is  a  marking-enabled 
transition  <  in  ^  such  that  A(  =  00;  is  said  to  be  tangible  otherwise.  We  additionally 
impose  the  interpretation  that,  in  a  vanishing  marking  p,  all  transitions  t  with  A{(/r)  <  00 
axe  implicitly  inhibited.  Hence,  a  transition  t  in  a  marking  p  is  enabled  in  the  usual  sense 
and  can  actually  fire  iff  it  is  meirking-enabled  and  either  p  is  tangible  or  p  is  vanishing 
and  A<(/i)  =  00. 

•  yt  £T,wi\  —*  IR"*"  describes  the  weight  assigned  to  the  firing  of  enabled  trcuisition 

t,  whenever  its  rate  At  evaluates  to  00.  Assume  that  the  set  of  transitions  X  C  T 
is  enabled  in  a  vanishing  marking  p.  Then,  the  probability  of  firing  transition  f  in  /i  is 
given  by  w?t(/^)/(Z)v€X  ^ ^  marking-dependent  weight  specification  is  not  needed, 

the  definition  of  w  can  be  reduced  to  Vt  €  TjUJt  €  IR^. 

The  SRN  components  described  so  far  define  a  trivariate  discrete-parameter  stochastic 
process:  {(/x„,r„,^„),n  6  IN},  /in  is  the  n-th  marking  encountered,  Tn  G  T  is  the  n-th  transition 
to  fire  (marking  /in^-i  is  obtained  by  firing  transition  t„  in  /i„),  and  >  0  is  the  time  at  which 
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it  fires  It  is  also  possible  to  define  a  continuous- time  process  d«^cribing  the 

marking  at  time  (9,  >  0},  which  is  completely  determined  gjven  {(^„,T„,^„),n  € 

Pl{9)  =  This  process  describes  only  the  evolution  with  respect  to  the  tangible 

markings,  that  is,  Pr{ii{9)  is  vanishing}  =  0. 

The  last  component  of  an  SRN  specification  defines  the  measures  to  be  computed: 

•  M  —  ...,(p|Af|,r|jvf|,^|jv/|)}  is  a  finite  set  of  measures,  each  specifying  the 

computation  of  a  single  real  value.  A  measure  (p,  r,<f>)  E  M  has  three  components.  The 
first  and  second  components  specify  a  reward  structure  over  the  underlying  stochastic 
process  {(/i„,T„,0n),n  6  IN),  p  :  — »  IR  is  a  reward  rate:  p{p)  is  the  rate  at 

which  reward  is  accumulated  when  the  marking  is  p.  Vt  e  T,rt  :  — *  IR  is  a 

reward  impulse:  rt{p)  is  the  instantaneous  rewaud  gained  when  firing  transition  t  while 
in  marking  p.  Often,  a  marking-dependent  reward  impulse  specification  is  not  needed 
and  the  definition  of  r  can  be  simplified  accordingly.  The  reward  structure  specified 
by  p  and  r  over  €  IN)  defines  a  new  stochastic  process  (y(9),9  >  0}, 

describing  the  reward  accumulated  by  the  SRN  up  to  time  9: 

y{^)=  j  p{p{u))du+ 

The  third  component  of  a  measure  specification,  is  a  function  that  computes  a  single 
real  value  from  the  stochastic  process  {¥{9)^9  >  0).  If  K  is  the  set  of  real-valued 
stochastic  processes  with  index  over  the  naturals,  then  <f>  :  -*  JR.  The  generality 

of  this  definition  is  best  illustrated  by  showing  the  wide  range  of  measures  the  triplet 
(p,  r,  cam  capture  (in  some  SRNs,  some  of  these  measures  might  be  infinite): 

—  Expected  number  of  transition  firings  up  to  time  9:  this  is  simply  E\Y{9)]  when  all 
reward  rates  are  zero  and  all  reward  impiilses  are  one. 

—  Expected  time- averaged  reward  up  to  time  9:  E  - 

—  Expected  instantaneous  reward  rate  at  time  9:  E  ^limi_o  . 

—  Expected  accumulated  reward  rate  in  steaidy-state:  Epim^_oo 
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-  Mean  time  to  absorption:  this  is  a  particular  case  of  the  previous  measure,  ob¬ 
tained  by  setting  the  reward  rate  of  transient  and  absorbing  states  to  one  and  zero, 
respectively,  and  all  reward  impulses  to  zero. 

—  Expected  instantaneous  reward  rate  in  steady-state:  E  j^lim4_oo  lim#-*©  j , 

which  is  also  the  same  as  the  expected  time-average  reward  in  steady-state:  E  |lim^_„x, 

—  Supremum  reward  rate  (assume  that  ail  reward  impulses  are  zero): 
supj>o  {v  :  u  G  IR  A  Pr  |lim^_o  =  t;|  >  o}. 

This  quantity  can  be  expressed  more  simply  using  the  stochastic  process  {{pn),«  € 

N):  !«ip„>a  {M  ■■  i’r{pW  =  p}  >  o}. 

Our  intention  is  to  define  parametric  SRNs.  This  can  be  accomplished  by  allowing  e«M:h 
component  of  an  SRN  to  depend  on  a  set  of  parameters  i/  =  (*'i, i^m)  €  IR*": 

A(i/)  =  {P(i/),r(i/),I?-(i/),Z)+(i/),/)“(i^),c(*/),>  (i/),po(»/),A(i/),u;(i/),M(«/)} 

Once  the  paraimeters  v  are  fixed,  a  simple  (non-parametric)  SRN  is  obtained. 

The  imderlying  stochastic  process  can  be  solved  analytically  to  compute  the  probability  of 
being  in  each  tangible  marking  p  at  time  6,  or  in  steady  state,  It  is  also  possible 

to  directly  compute  the  cumulative  time  spent  in  each  tangible  marking  p  during  the  interval 
All  the  measures  described  in  this  paper  are  expressed  as  expectations  using 
reward  rates  only  and  they  can  be  easily  computed  as  a  linear  combination  of  the  values  of 
these  probabilities  or  cumulative  times. 

3  Analysis  of  a  producer-consumer  tasking  system 

Consider  a  computer  system  where  data  items  produced  by  Nf  producers  are  consumed  by 
Nc  consumers.  The  exchange  of  items  between  the  IVp  producer  tasks  and  the  Nc  consumer 
tasks  is  performed  using  one  additional  buffer  task.  A  pseudo-Ada  description  of  this  system 
appears  in  Figure  1  [7J. 
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The  buffer  task  stores  the  incoming  items  into  array  Slots,  having  N,  positions,  and  it  uses 
the  integer  variable  FuUSlots  to  keep  track  of  the  number  of  non-empty  slots.  Producer  tasks 
cannot  pass  items  to  the  buffer  tcisk  when  FuUSlots  is  equal  to  N,  and  consumer  tasks  cannot 
retrieve  items  from  the  buffer  task  when  FuUSlots  is  equal  to  0.  The  number  of  produced  items 
cannot  then  exceed  the  number  of  consumed  items  plus  N,.  A  larger  value  for  N,  can  only 
attenuate  the  effect  of  temporary  increases  in  the  production  or  consumption  rates,  but  these 
are  equal  in  the  long  run. 

The  mechanism  by  which  two  Ada  tasks  synchronize  and  exchange  data  is  the  “rendezvous”. 
Whenever  a  producer  task  has  an  item  ready  to  pass,  it  issues  an  “entry  caU”  to  the  buffer 
task  (Une  16).  If  the  buffer  task  accepts  this  entry  call,  the  rendezvous  takes  place,  FuUSlots  is 
incremented,  and  the  item  is  copied  into  array  Slots  (lines  35-37);  similarly,  a  rendezvous  with 
a  consumer  (lines  41-43)  decrements  FuUSlots  by  one. 

Each  “entry”  (Unes  05  and  06)  has  an  associated  queue,  where  tasks  making  an  entry  call 
wzut  for  a  rendezvous.  The  presence  of  “guards”  EnablePut  and  EnableGet  (lines  34  and  40) 
inhibits  the  rendezvous  at  the  guarded  entry  if  the  boolean  condition  is  false  (the  guard  is 
“closed”).  Table  I  describes  the  value  assumed  by  these  boolean  predicates  based  on  three 
factors  (presence  of  tasks  in  each  of  the  two  queues  and  value  of  variable  FuUSlots),  for  the  five 
different  policies  discussed.  When  the  buffer  task  is  ready  to  rendezvous,  the  following  cases 
can  arise: 

•  Only  one  guard  is  open,  but  its  associated  queue  is  empty.  This  happens  only  when  aU 
the  slots  are  full  and  no  consumer  is  waiting,  or  aU  the  slots  are  empty  and  no  producer 
is  waiting.  A  rendezvous  cannot  take  place  right  away;  the  buffer  task  waits  imtil  a  task 
joins  the  queue  with  the  open  guard. 

•  Both  guards  are  open,  but  their  associated  queues  are  empty.  A  rendezvous  cannot  take 
place;  the  buffer  task  waits  for  the  first  task  to  join  any  of  the  two  queues. 

•  Only  one  guard  is  open  and  its  easociated  queue  contains  at  least  one  task.  A  rendezvous 
with  the  first  task  in  that  queue  takes  place  immediately. 
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01  Nf  :  constant  :=  namier  of  producers-, 

Ne  :  constant  :=  n«m6er  of  consumers; 

Nt  :  constant  :=  num6er  of  buffer  slots; 

task  Buffer  is 

entry  Put(  Item  :  in  data  ); 
entry  Get(  Item  :  out  data  ); 
end  Buffer; 

task  type  Producer; 
task  type  Consumer; 

Producers  ;  array  (  1  ..  Np)  of  Producer; 
Consumers  :  array  (  1  ••  Ne)  of  Consumer; 

task  body  Producer  is 
Item  :  data; 
begin 
loop 

Buffer. Put(  Item  ); 
statements  Sp; 
end  loop; 
end  Producer; 

task  body  Consumer  Is 
Item  :  data; 
begin 
loop 

Buffer. Get(  Item  ); 
slalemenis  5c; 
end  loop; 
end  Consumer; 

task  body  Buffer  is 

Slots  ;  array  (  1  ..  N,)  of  data; 

FuUSlots  :  Natural  :=  0; 
begin 
loop 
select 

when  EnablePut  => 

accept  Put(  Item  :  in  data  )  do 
FullSlots  :=  FuUSlots  +  1; 
Slot8(  FuUSlots )  :=  Item; 
end  Put; 
or 

when  EnableGet  => 

accept  Get(  Item  :  out  data  )  do 
Item  :=  Slot8(  FuUSlots  ); 
FuUSlots  :=  I^Slots  -  1; 
end  Get; 
end  select; 
statements  Sb; 

47  end  loop; 

48  end  Buffer; 


Figure  1:  Pseudo-Ada  descripUou^of  the  producer-consumer  system. 


•  Both  guards  axe  open  and  their  associated  queues  both  contain  at  least  one  task.  A 
rendezvous  with  either  the  first  producer  or  the  first  consumer  in  their  respective  queuea 
takes  place  immediately. 

Only  the  fourth  case  requires  a  choice  between  a  rendezvous  with  a  producer  and  a  ren¬ 
dezvous  with  a  consumer.  In  its  definition,  Ada  maJkes  no  guarantee  about  which  queue  is 
actually  selected.  If  a  particular  selection  policy  is  desired,  it  can  be  enforced  by  modifying 
the  guard  predicates  so  that,  when  no  queue  is  empty,  exactly  one  guard  is  open. 

In  Table  I,  five  different  policies  are  presented: 

•  Nondeterministic  (ND):  nondeterministically  select  either  a  producer  or  a  consumer,  with 
uniform  probability.  This  can  be  accomplished,  for  example,  using  a  pseudo-random 
number  generator.  It  is  also  possible  to  remember  the  selection  made  the  previous  time 
in  this  situation  and  toggle  the  selection;  this  is  likely  to  be  faster,  but  it  introduces  a 
correlation  in  the  sequence  of  selections. 

•  Producer  First  (PF):  select  a  producer. 

•  Consumer  First  (CF):  select  a  consumer. 

•  Proportional  (PR):  nondeterministically  select  either  a  producer  or  a  consumer,  but, 
insteari  of  using  uniform  probability  for  producers  and  consumers,  use  a  probability  split 
proportional  to  the  number  of  empty  and  full  slots,  respectively.  This  bias  tends  to  keep 
the  number  of  empty  and  full  slots  more  balanced,  which  is  intuitively  a  good  idea. 

•  Threshold(TH):  choose  a  producer  if  more  slots  are  empty  than  full;  choose  a  consumer 
if  more  slots  are  full  thm  empty;  choose  either  with  uniform  probability  if  exactly  half 
of  the  slots  are  full.  This  policy  tries  to  achieve  the  same  goal  as  the  previous  one,  but 
deterministically.  When  exactly  half  of  the  slots  are  full,  the  behavior  is  the  same  as  in 
the  ND  policy;  for  simplicity,  we  assume  N,  to  be  odd,  so  this  case  cannot  arise. 

So  far,  the  description  of  the  system  has  been  focused  on  the  software,  but  the  actual 
timing  behavior  is  determined  also  by  the  hardware  architecture  and  by  the  allocation  of  tasks 
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Table  I 

Values  for  Boolean  Predicates  EnablePut  and  EnableGet. 


Policy 

Condition 

Value  returned 

Producers 

Consumers 

Value  of 

EnablePut 

EnableGet 

Waiting 

Waiting 

FVillSlots 

X 

X 

X 

0 

true 

fadse 

X 

N, 

false 

true 

X 

no 

no 

1..N,  - 1 

true 

true 

X 

yes 

no 

true 

X 

X 

no 

yes 

X 

true 

ND 

yes 

yes 

a 

not  a 

PF 

yes 

yes 

L.Np  - 1 

true 

false 

CF 

yes 

yes 

l..iV.  - 1 

false 

true 

PR 

yes 

yes 

l..iV,  - 1 

not  0 

TH 

yes 

yes 

true 

false 

\N,  +  l]/2..N,-l 

false 

true 

NJ2  {N,  even) 

a 

not  Q 

Note;  X  means  that  the  value  is  not  relevant. 

o  is  a  boolean  random  variable  with  Pr{a  =  true}  =  1/2. 

/?  is  a  boolean  random  variable  with  Pr{0  =  true)  =  {N,  —  Fullslots)/^^,. 


to  processors.  Three  possibilities  are  considered; 

•  SINGLE:  a  classic  single  processor  eirchitecture,  where  all  tasks  share  the  same  CPU. 


•  THREE:  a  three-processor  architecture,  one  processor  for  the  Np  producer  tasks,  another 
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Pwait 


Empty 


Cwait 


Figure  2:  The  SRN  for  the  producer-consumer  system. 

for  the  Nc  consumer  tasks,  and  the  last  one  for  the  single  buffer  task. 

•  MANY:  a  one-processor-per-task  architecture,  with  no  processor  sharing. 

The  actual  number  of  processors  in  the  actual  system  could  probably  be  somewhere  between 
3  and  Np  +  Nc  +  ly  the  single  processor  architecture  is  considered  mainly  for  reference. 

3.1  SRN  model 

The  system  just  described  is  concisely  modeled  by  the  SRN  in  Figure  2.  Tokens  in  places 
Plocal,  Clocal,  and  Blocal  represent  tasks  (of  the  appropriate  type)  executing  the  statements 
Spi  Sc,  and  Sb,  respectively,  while  tokens  in  places  Pwait,  Cwait,  and  Bwait  represent  tasks 
waiting  for  a  rendezvous  at  the  Put  or  Get  entries.  The  tokens  in  places  Empty  and  Full 
count  the  number  of  empty  and  full  slots,  respectively. 

Transitions  Sp,  Sc,  and  Sb  are  assumed  to  be  “blaick  boxes”  with  an  exponentially  dis¬ 
tributed  time  duration,  but  they  could  be  changed  into  a  more  detailed  phase- t3^e  expansion 
(using  a  “subnet”)  if  more  information  were  available  about  the  actual  structure  of  the  code. 
This  would  increase  the  size  of  the  reachability  graph,  but  it  would  also  allow  a  more  precise 
representation  of  the  timing  behavior  in  the  case  exponential  distributions  were  not  adequate. 
Immediate  transitions  Put  and  Get  correspond  to  the  actions  in  the  rendezvous  (lines  34-38 
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and  40-44  in  Figure  i,  respectively),  which  are  modeled  as  instantameous,  since  the  time  spent 
for  them  is  likely  to  be  negligible  compared  to  the  other  blocks  of  statements  (if  not,  the  SRN 
could  be  modified  to  represent  these  times  durations  explicitly). 

The  five  different  selection  policies  described  in  the  previous  section  are  obtained  by  defin¬ 
ing  the  appropriate  value  for  predicates  EnabledPut  and  EnableGet.  Exactly  analogous  to 
them,  and  even  simpler  to  specify,  are  the  “enabling  functions”  ep^t  and  eoet  associated  with 
tramsitions  Put  and  Get,  respectively: 

false  if  (policy  =  CF  and  if{Cwait)  >  0  and  #(Fu//)  >  0) 
eput  =  <  or  (policy  =  TH  aind  ^{Cwait)  >  0  and  ^{Empty)  >  ^{Full)) 

true  otherwise 

► 

false  if  (policy  =  PF  and  enabled(Put)) 

^Get  =  ‘  or  (policy  =  TH  and  ^{Pwait)  >  0  and  ^{Empty)  <  ^{Full)) 

true  otherwise 

In  addition,  the  probabilistic  choices  in  the  the  ND  and  PR  policies  (and  TH,  when  N,  is  even) 
can  be  specified  by  assigning  weights  wp^  and  iwoet  to  the  two  transitions: 

{H^iEmpty)  if  policy  =  TH 
1  otherwise 

^{Full)  if  policy  =  TH 

WGct  = 

1  otherwise 

The  specification  of  the  rates  for  the  remaining  three  tramsitions  completes  the  description 
of  the  SRN.  These  rates  aue  related  to  the  times  required  to  execute  the  blocks  of  statements 
Sp,  Sc,  and  Sb,  but  also  to  the  type  of  hardware  architecture,  since  sharing  the  processor  slows 
down  the  execution.  Table  II  shows  the  firing  rates  used  assuming  perfect  processor  sharing 
with  no  context  switch  overhead,  and  assuming  that  the  times  required  to  execute  blocks  Sp, 
Sc,  and  Sb  for  a  task  running  on  a  processor  in  isolation  (no  sharing)  are  0.003,  0.003,  and 
0.0005  seconds,  respectively. 
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Table  II 


Rates  for  the  Transitions  of  the  Producer-Consumer  SRN 


IVansition 

Architecture 

Firing  rate  (sec”^) 

SINGLE 

if:{Plocal)l{OmZ{if:{Plocal)  +  MClocal)  4-  if^iBlocal))) 

^Sp 

THREE 

1/0.003 

MANY 

#(P/oca/)/0.003 

SINGLE 

if:{Clocal)fiQmZi§{Plocal)  +  MOlocai)  +  H^iBlocal))) 

Ase 

THREE 

1/0.003 

MANY 

#(G/oca/)/0.003 

SINGLE 

l/(0.0005(#(P/oca/)  -f-  ^{Clocal)  +  1)) 

^Sb 

THREE 

1/0.0005 

MANY 

1/0.0005 

Before  concluding  this  section,  it  is  useful  to  compute  the  number  of  markings  generated  by 
the  SRN  analysis,  both  to  check  the  correctness  of  the  model  and  to  avoid  atte  npting  solutions 
that  require  excessive  resources  (memory  in  particulcir).  For  the  parametric  SRN  of  Figure  2, 
this  number  is  a  function  of  the  parameters  Np,  Nc,  and  N,  (the  policy  and  the  architecture 
axe  <dso  pcirameters,  but  they  do  not  affect  the  number  of  markings).  Table  III  shows  how  to 
count  the  exact  number  of  vanishing  and  tangible  markings  using  a  case  analysis.  Since  the 
number  of  markings  grows  as  0{NtNpNc),  it  is  possible  to  study  the  system  for  reasonably 
large  values  of  these  three  parameters  (SRNs  with  w  10^  markings  can  be  normally  analyzed 
in  a  few  minutes  on  a  workstation,  but  SRNs  with  «  10®  or  even  w  10®  markings  can  be  solved 
in  a  matter  of  hours,  if  enough  memory  is  available). 
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Table  III 

Marking  Count  for  the  Producer-Consumer  SRN 


Contents  of  place 

Markings 

Empty 

Pwait 

Cwait 

Bwait 

Type 

Count 

0..N, 

O..Np 

O..Nc 

0 

tangible 

(iV.  +  l)(iVp-f  l)(iV,-Hl) 

0..iV. 

0 

0 

1 

tangible 

iV.  +  l 

0..iV,  _  1 

0 

1 

vanishing 

1..N, 

h.Nj, 

0 

1 

vanishing 

O-AT. 

h.Nj, 

l..N^ 

1 

vanishing 

{N,  -f  1)N^N, 

N. 

0 

1..N, 

1 

tangible 

Nc 

0 

l..Ap 

0 

1 

tangible 

tangible  markings: 

{N,  -f  +  1){N,  +  1)  -t-i)  +  jVp  -f  JV, 

vanishing  markings: 

iN,  +  l)N,Nc  +  N^{N^  +  Nc) 

3.2  Performance  analysis 

The  five  policies  defined  earlier  have  a  simple  implementation  in  Ada.  Even  the  ones  requiring 
a  pseudo-random  number  generator  introduce  only  a  smaJl  overhead  compared  to  the  number 
of  statements  likely  to  constitute  the  blocks  Sc,  and  Sb. 

The  selection  of  a  policy  among  the  five  ones  presented  could  then  be  based  on  the  effect 
that  these  policies  have  on  the  performuice  of  the  system  (in  3te^Kly- state).  Different  aspects 
of  the  system  behavior  might  be  the  most  relevant  in  defining  “performance”: 

•  Response  time  for  producers,  consumers,  or  both. 

•  Probability  distribution  of  the  number  of  producer'j  blocked  because  all  slots  are  full. 

•  Probability  distribution  of  the  number  of  consumers  blocked  because  all  slots  cue  empty. 
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•  Throughput  of  the  system,  expressed  as  the  number  of  items  passed  from  any  producer 
to  any  consumer  in  a  unit  of  time. 

For  the  purpose  of  this  study,  the  throughput  of  the  system,  r,  is  used.  In  the  SRN  of 
Figure  2,  the  throughput  of  the  producers,  r,,  can  be  computed  by  defining  the  reward  rate 
in  marking  ^  as  the  rate  of  transition  Sp,  auid  computing  the  expected  reward  rate  in 

steady-state; 

=  E 

Tc  and  Ti  can  be  computed  in  a  similar  way,  or  by  observing  that  r  ~  —  7^/2. 

The  value  of  t  with  the  SINGLE  architecture  is  the  same  independent  of  the  policy  adopted 
and  of  the  number  of  slots,  N,,  producers,  N,,  or  onsuniers,  (as  long  as  none  is  zero): 
r  =  142.857  sec“‘.  The  reason  is  that  the  only  processor  is  always  busy,  so  r  is  simply  the 
inverse  of  the  total  time  spent  to  process  each  item:  O.CCd  seconds  in  the  produror  task,  0.0005 
seconds  in  the  buffer  task  after  the  rendezvous  with  a  producer,  plus  another  0.0(K)5  seconds 
after  the  rendezvous  with  a  consumer,  and  finally  0.003  seconds  in  the  consumer  task,  for  a 
total  of  0.007  seconds  (1/0.007  =  142.857). 

With  the  THREE  and  MANY  architectures,  t  is  instead  affected  by  the  three  parameters. 
Figures  3  and  4  plot  r  as  a  function  of  N,  for  these  two  architectures  using  the  ND  policy  in  a 
balemced  system  {Nj,  —  Nc).  The  effect  of  different  policies  is  minor  compared  to  doubling  Np 
and  Nc,  so  it  is  studied  later  in  this  section. 

With  the  THREE  architecture,  the  improvement  due  to  th*’  increase  in  N,  is  sublinear. 
In  addition,  it  is  less  noticeable  for  larger  values  of  Np  =  Nc,  since,  after  a  certain  point, 
the  processors  for  the  producers  and  the  consumers  become  the  bottleneck.  In  this  case,  the 
limit  for  r  is  the  inverse  of  the  maximum  of  of  the  time  spent  on  each  item  by  each  processor 
(0.003,  0.001,  and  0.003  sec  respectively);  lira/ir^^,jv,-.oo  t  —  1/0.003  sec~’  =  333.333  sec"’. 
For  example,  r  =  329.340  sec"’  when  Np  —  Nc  —  32  and  N,  —  19  (not  shown). 

The  improvement  due  to  increasing  N,  with  the  MANY  architecture  is  even  smaller.  Fur¬ 
thermore,  it  appears  that  the  system  is  saturated  when  Np  =  Nc  =  B  and  no  appreciable 
improvement  is  achieved  by  increasing  N,.  The  reason  is  again  to  be  found  in  the  bottleneck, 
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this  time  the  buffer  task.  The  production  (and  consumption)  rate  could  now  be  as  high  as 
iVp/0.003  sec"^,  since  each  task  has  a  dedicated  processor,  but  the  buffer  task  is  involved  in  two 
rendezvous  for  each  item,  so  that  an  upper  bound  on  t  is  given  by  1/0.001  =  1000  sec”*.  Since 
4/0.003  >  1000,  it  appears  that  the  buffer  task  is  already  the  bottleneck  when  Np  =  Ne  —  4, 
although,  in  this  case,  increasing  the  number  of  slots  still  has  a  visible  effect  on  r  (but  the 
increase  is  smaller  than  when  Np  =  Nc  =  1,2).  To  summarize  this  first  part  of  the  analysis: 

•  It  13  advantageous  to  increase  the  number  of  producers  and  consumers,  even  if  the  total 
computational  capacity  remains  constant  (architecture  THREE).  Depending  on  the  na¬ 
ture  of  the  system,  though,  this  may  not  be  possible,  since  the  number  of  tasks  could  be 
dictated  by  external  considerations  (e.g.,  each  producer  task  monitors  a  different  sensor). 

•  Increcising  the  number  of  slots  is  always  advantageous,  but  particularly  so  when  only  a 
few  producer  and  consumer  tasks  are  present. 

•  On  a  highly  parallel  architecture  (MANY),  the  buffer  task  soon  becomes  a  bottleneck. 
This  points  out  a  limitation  of  the  Ada  rendezvous.  The  buffer  task  must  be  introduced 
because,  in  Ada,  a  task  performing  an  entry  call  must  know  the  identity  of  the  callee, 
so  it  is  not  possible  to  let  a  producer  rendezvous  directly  with  any  consumer  using  a 
single  entry  call.  This  problem  can  be  alleviated  by  having  several  buffer  tasks  and 
partitioning  the  producer  and  consumer  tasks  so  that  each  buffer  task  serves  only  a 
subset  of  the  producers  and  consumers.  Partitioning,  though,  introduces  a  different  kind 
of  inefficiency.  Producers  associated  to  a  buffer  task  having  all  the  slots  full  sit  idle,  even 
if  other  buffer  taisks  may  have  some  or  even  all  the  slots  empty. 

Considering  now  the  effect  of  the  selection  policies,  it  is  immediately  apparent  that  there 
is  no  absolute  “optimal”  policy.  Figure  5  shows  t  as  a  function  of  N,  in  an  unbdanced  system 
(Np  =  4,  iVc  =  2),  with  the  MANY  architecture,  for  the  five  policies.  The  ability  of  producers 
to  produce  is  higher  than  the  ability  of  consumers  to  consume,  hence  giving  precedence  to 
consumers  (CF  policy)  tends  to  restore  the  balance  and  is  the  optimal  choice,  while  the  PF 
policy  increases  the  unbalance,  resulting  in  the  worst  throughput.  The  plot  for  Np  =  2,  Nc  —  4 
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Figlire  3:  r  (in  sec~^)  as  a  function  of  N,  (THREE  architecture,  ND  policy). 


Figure  4:  r  (in  sec  as  a  function  of  N,  (MANY  architecture,  ND  policy). 

(not  shown)  is  exactly  the  same  as  the  one  for  Np  =  4,  iVc  =  2,  with  the  exception  that  the 
labels  for  the  PF  and  CF  policies  axe  reversed:  the  PF  policy  is  now  optimal  while  the  CF 
policy  is  the  worst,  amd  the  others  achieve  the  same  value  (the  ND,  PR,  and  TH  policies  are 
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Figure  5:  r  (in  sec  as  a  function  of  iV,  (MANY  architecture,  Np  =  4,  Ac  =  2). 


Figure  6:  r  (in  sec~^)  as  a  function  of  N,  (MANY  zurchitecture,  Np  =  4,  Ac  =  4). 


“symmetric”,  so  it  should  not  be  surprising  that  they  result  in  the  same  throughput  when 
Ap  =  4,  Ac  =  2  and  when  Np  =  2,  Ac  =  4). 

The  TH  policy  is  the  second  best  in  either  case,  followed  by  the  PR  and  ND  policies,  in  that 
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Figure  7:  r  (in  sec~^)  as  a  function  of  N,  (MANY  architecture,  Np  =  8,  A^c  =  4). 

order.  This  can  be  justified  by  observing  that  the  PR  policy  is  a  (not  so  useful)  compromise 
between  the  TH  policy,  which  deterministically  tries  to  achieve  balance  in  the  number  of  used 
and  free  slots,  and  the  ND  policy,  which  completely  ignores  the  status  of  the  slots. 

The  same  reasoning  explains  the  effect  of  the  five  policies  in  a  balanced  system  where 
iVp  =  iVc  =  4  (Figure  6).  The  two  cisymmetrical  policies,  PF  and  CF,  are  equally  poor,  while 
the  TH,  PR,  and  ND  policies  are  at  the  top,  in  that  order. 

The  TH  policy  is  a  consistently  good  choice  (nearly  optimal  in  an  unbalanced  system, 
optimal  in  a  balanced  system).  If  the  number  of  producers  and  consumers  is  subject  to  change 
during  the  deployment  of  the  system,  the  TH  policy  is  the  best  choice  because  of  its  easy 
implementation  and  predictable  performance.  For  example,  a  system  initially  unbalanced  in 
favor  of  consumers  could  suggest  the  adoption  of  the  PF  policy,  but  inefficiencies  would  arise 
if  the  situation  had  to  be  reversed  later  (the  difference  between  the  best  and  worst  policy  in 
Figure  5  is  over  5%). 

The  plots  for  the  PF  and  CF  policy  in  Figure  6  appear  to  have  a  much  slower  rate  of 
increase  than  the  plots  for  the  other  three  policies.  This  can  be  explained  by  considering 
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what  happens  with  the  PF  policy  when  the  system  is  not  unbalanced  towcird  the  consumers 
{Np  >  Nc)  and  the  buffer  task  is  the  bottleneck  (JVe/0.003  >  1/0.001).  In  this  case,  there 
are  often  both  producers  2uid  consumers  waiting  whenever  the  buffer  is  ready  to  rendezvous. 
The  PF  policy,  though,  chooses  a  producer  whenever  possible,  that  is,  whenever  there  is  at 
least  one  empty  slot.  The  effect  of  this  behavior,  in  the  limit,  is  to  let  the  number  of  full  slots 
alternate  between  N,  (choose  a  consumer)  and  iV,  —  1  (choose  a  producer),  with  a  negative 
effect:  the  system  might  as  well  have  just  a  single  slot  {N,  =  1),  since  the  “window”  between 
the  number  of  produced  and  consumed  items  is  effectively  restricted  to  one  most  of  the  time. 

Figure  7  is  even  more  dramatic,  showing  no  appreciable  increase  at  all  for  the  PF  policy. 
The  values  of  r  for  a  system  with  the  MANY  architecture,  Np  =  8,  Nc  =  4,  and  iV,  =  125  (not 
shown)  confirm  this  observation.  The  difference  between  the  optimal  CF  policy  and  the  TH 
policy  is  minimal  (994.3  sec~^  and  994.1  sec"^,  respectively),  while  the  PF  policy  lags  seriously 
behind  (891.4  sec“^).  Even  more  illuminating  is  the  inspection  of  the  probabilities  !!«  that  all 
slots  are  empty  and  11  /  that  all  slots  tire  full: 

<  10-^2  (ND,  PF,  PR,  and  TH  policies) 

He  =  Yl  ’Tp  =  j 

M:*(Emptv,u)=ff.  0.13929526  (CF  policy) 

0.275976  (ND  policy) 

0.553947  (PF  policy) 
n/  =  Y  =  {  0.000003  (CF  poUcy) 

0.010870  (PR  policy) 

0.000499  (TH  policy) 

These  probabilities  should  be  kept  small,  since,  when  all  slots  are  full  (empty),  no  rendezvous 
can  take  place  with  a  producer  (consumer),  thus  increasing  the  probability  that  the  buffer 
task,  which  is  the  real  bottleneck,  remains  idle.  While  He  is  numerically  negligible  only  for  the 
non-optimal  policies,  H/  is  the  most  relevant  quantity  to  observe,  since  Np  >  Nc]  H/  is  small 
for  both  the  CF  and  TH  policies  (about  170  tim^  smaller  for  CF  than  for  TH,  although  this 
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difference  affects  the  actual  throughput  t  only  marginally),  but  it  is  definitely  high  for  the  PF 
policy,  resulting  in  a  considerably  smaller  value  of  t. 

The  average  number  vj  oi  full  slots  also  confirms  this  behavior: 

123.22  (ND  policy) 

124.55  (PF  policy) 

=  9.66  (CF  policy) 

100.72  (PR  policy) 

71.40  (THpoHcy) 

The  TH  policy  does  indeed  achieve  the  value  of  vj  closest  to  iV,/2  =  62.5,  but  is  is  still 
sub-optimal,  suggesting  that  the  ideal  value  for  Vf  is  actually  a  function  of  Np  and  Ne  as  weU. 

4  Fault-tolerant  software 

Design  diversity  as  a  means  of  achieving  fault-tolerance  in  software  has  been  suggested  by 
several  authors.  Fault-tolerant  software  using  this  method  include  N-version  programming  [4] 
and  recovery  blocks  [22].  The  former  uses  voting  on  the  results  of  various  versions  for  error 
detection  and  the  latter  uses  an  2M:ceptance  test  (AT)  and  rollback  recovery.  While  these  are 
the  two  major  approaches  to  software  fault-tolerzuice,  several  hybrid  methods  have  also  been 
proposed  [22,  24]. 

In  this  section,  we  an2dyze  the  recovery  block  (RB)  scheme;  a  fault-tolerant  software  con¬ 
struct  that  uses  design  diversity  [22].  It  consists  of  a  primary  module,  one  or  more  alternate 
modules  and  an  AT.  The  primzury  and  the  alternate  modules  are  based  on  different  adgorithms 
for  the  same  problem  and  may  be  implemented  by  different  programmers.  On  a  given  set  of 
data  inputs,  the  primary  is  executed  first  and  the  results  are  checked  using  the  AT.  Should 
the  AT  fail  to  accept  the  results,  the  alternate  modules  are  invoked  in  succession  imtil  one  is 
found  to  produce  results  that  are  accepted  by  the  test  or  until  all  of  them  fail  to  satisfy  the 
AT.  In  the  latter  case,  the  RB  is  said  to  have  failed  on  this  input  data  set.  The  pseudocode 
for  a  RB  with  a  primary  module  and  m  alternate  modules  is  shown  below: 
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ensure  acceptance  test 
by  primary  module 
else  by  alternate  module  1 
else  by  alternate  module  2 

else  by  alternate  module  m 
else  error 

Probabilistic  models  of  RBs  have  been  considered  by  several  authors  [3,  9,  25).  Discrete 
time  Markov  chains  (DTMCs)  have  been  used  to  derive  measures  like  the  probability  of  RB 
failure  or  the  number  of  inputs  (correctly)  processed  until  RB  failure;  continuous  time  Markov 
chains  (CTMCs)  have  been  used  to  derive  time-based  measures  like  the  mean  time  to  failure 
(MTTF)  or  the  (un)reliability  of  the  RB.  We  note  that  if  we  are  able  to  2inalyze  a  CTMC  for 
transient  cumulative  measures  besides  transient  instantaneous  measures,  we  can  derive  both 
the  above  types  of  measures  using  a  CTMC,  i.e.,  we  do  not  need  to  resort  to  two  different 
formalisms  depending  on  the  measures  desired. 

Pucci  [21]  points  out  some  of  the  difficulties  in  estimating  the  parameters  used  in  ecU’liei 
models.  He  classifies  events  occurring  in  a  RB  into  four  distinct  categories  baised  on  the 
behavior  of  the  alternate  modules  and  the  AT.  Four  different  events  can  occur: 

(1)  Module  i  produces  correct  results  which  the  AT  accepts. 

(2)  Module  i  produces  correct  results  which  the  AT  rejects. 

(3)  Module  t  produces  incorrect  results  which  the  AT  rejects. 

(4)  Module  i  produces  incorrect  results  which  the  AT  accepts. 

It  is  easier  to  estimate  parameters  corresponding  to  these  events.  We  consider  a  similar  event 
classification  in  the  model  presented  in  the  next  section. 
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4.1  SRN  model 


In  this  section,  we  present  a  general  SRN  model  for  the  recovery  block  scheme.  The  primary 
modiile  is  indexed  by  0  and  the  alternate  modules  are  indexed  1  through  m.  The  execution 
time  of  module  i  is  assumed  to  be  exponentially  distributed  with  mean  l/Xrmi  and  that  of  the 
AT  is  exponentially  distributed  with  mean  l/Xratnei  =  l/XTate^  The  probability  that  module 
i  produces  incorrect  output  is  p,.  We  assume  that  the  AT  fails  to  detect  erroneous  module 
output  with  probability  p*.  This  probability  corresponds  to  event  (4)  mentioned  above.  We 
assume  that  this  event  is  not  catastrophic,  unlike  the  assump  tion  used  by  Pucci  [21].  However, 
it  is  easy  to  change  our  model  to  make  this  event  catastrophic.  The  AT  might  raise  a  false 
alarm  with  probability  p/,  which  corresponds  to  event  (2)  above.  We  assume  that  this  event 
does  not  result  in  subsequent  rejection  of  results  from  the  other  alternate  modules,  unlike 
as  aissumed  by  Pucci  [21].  This  assumption  can  easily  be  changed  in  the  model.  We  let  pc 
be  the  probability  that  recovery  following  a  failure  to  satisfy  the  AT  is  successful.  We  must 
realize  that  all  the  above  probabilities  pertaining  to  any  module  i  are  conditional  probabilities, 
conditioned  upon  the  fact  that  module  i  is  actually  invoked  and  that  the  previous  t  —  1  modules 
have  fculed.  Thus,  the  correlation  between  the  softwwe  modules  is  automatically  2w:counted 
for  by  the  conditional  nature  of  these  probabilities. 

The  SRN  model  of  a  recovery  block  is  shown  in  Figure  8.  The  net  is  nearly  self-explanatory. 
Place  Pmo  is  the  starting  point  of  the  RB.  The  firing  of  transition  Tttiq  corresponds  to  the 
completion  of  the  execution  of  the  primary  module.  Transitions  Tneo  and  T cq  correspond  to 
the  events  that  the  results  produced  by  the  module  are  correct  and  incorrect  respectively  and 
have  weights  1  —  p,-  and  p,-,  respectively.  Transition  Tatneo  represents  the  execution  of  the  AT 
after  the  module  produces  correct  results.  The  immediate  transitions  Tsq  and  T/oq,  which 
correspond  to  events  (1)  and  (2)  mentioned  above,  are  then  enabled.  The  weights  of  these  two 
transitions  are  given  by  1  —  p/  and  p/  respectively.  Transition  Tateo  represents  the  execution 
of  the  AT  after  the  module  produces  incorrect  results.  The  immediate  transitions  Tseo  and 
T eeo,  which  correspond  to  events  (3)  and  (4)  mentioned  above,  are  then  enabled.  The  weights 
of  these  two  transitions  2u:e  1  —  pe  and  p*  respectively.  Once  an  error  is  discovered,  represented 
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Pma 


Figure  8:  SRN  model  for  a  recovery  block. 

by  the  firing  of  either  T fao  and  Taco,  the  system  initiates  a  recovery  action.  TVansition  Tsro 
represents  a  successful  recovery  after  a  failure  of  the  AT  and  transition  T flo  represents  an 
unsuccessful  recovery,  thus  resulting  in  the  RB  fjulure.  The  corresponding  weights  of  these 
two  transitions  are  Pc  and  1  —  Pc  respectively.  The  output  arc  from  Tsro  leads  to  Pmi,  the 
starting  place  of  the  first  alternate  module,  while  the  output  arc  from  Tflo  leads  to  Pfail 
which  represents  the  RB  fadlure. 

The  alternate  modules  are  similarly  modeled  by  the  other  places  and  transitions  indexed 
from  1  to  m.  The  structure  of  the  last  module  is  slightly  different,  since  the  failure  of  the  last 
module  automatically  results  in  a  system  failure.  Thus,  the  output  arcs  from  transitions  T fom 
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and  TsCm  lead  to  place  Pfail. 

We  can  compute  the  mean  time  to  recovery  block  failure  or  the  distribution  of  time  to 
recovery  block  failure  (its  complement  is  the  reliability  of  recovery  block).  For  this  purpose, 
we  assign  reward  rate  1  to  all  markings  in  which  there  is  no  token  in  place  Pfail]  all  other 
markings  are  assigned  a  reward  rate  equal  to  zero.  If  we  now  compute  the  expected  accumulated 
reward  until  system  failure,  we  obtain  the  MTTF: 

MTTF^  XI  r'^Ar)dT 

By  computing  the  expected  reward  rate  at  time  0,  we  obtain  instead  RB  reliability  at  time  9: 

p'O)  =  i: 

The  unreliability  UR{9),  or  probability  of  being  failed  by  time  9,  is  simply  given  by  1  —  R{9). 

We  can  also  comp^ute  the  number  of  number  of  data  sets  processed  imtil  system  failure,  N, 
or  until  a  specified  time  9,  N{9).  To  compute  these  quantities,  we  assign  the  rate  of  transition 
Tmo,  Armo(/^)j  as  the  reward  of  marking  fi.  The  expected  accumulated  reward  until  system 
failure  or  by  time  9  yield  respectively  N  and  N[9). 

^' =  X]Armo(M)^  x^(r)dT 

r^{T)dT 

The  MTTF  and  N,  the  expected  number  of  inputs  processed  until  system  f2dlure,  as  a 
function  of  the  number  of  modules  available  in  the  system  (including  the  primary  module)  are 
shown  in  Figure  9.  We  assume  that  the  execution  rate  of  the  primary  module  is  Armo  =  1 
min“^.  For  each  alternate  module,  we  assume  that  the  execution  rate  is  three-quarters  that 
of  the  previous  module,  i.e.,  Arm.-  =  0-75Arm,-^i.  This  is  a  re2eonable  assumption,  since  we 
would  tend  to  use  the  fastest  module  as  the  primary  module.  The  execution  rate  of  the  AT 
is  ^Tatci  =  ^Tatnei  =  100  min“^.  The  probabilities  are  Vi,  pi  —  0.1,  pe  =  0.01,  p/  =  0.01, 
and  Pc  =  0.999.  Figure  9  shows  how  an  increase  in  the  number  of  alternate  modules  causes 
an  increase  in  MTTF  and  N.  Further,  it  is  interesting  to  note  that  the  greatest  benefit  of 
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Figure  9:  MTTF  and  N  for  the  RB  as  function  of  the  number  of  modules. 


increasing  the  number  of  alternate  modules  is  between  1  and  5.  Beyond  five  alternate  modules, 
the  additional  benefit  is  quite  small. 


Figure  10:  Failure  probability  for  the  RB  as  a  function  of  time. 


The  distribution  of  the  time  to  failure  for  the  RB  with  1,  2,  and  3  alternate  modules  is 
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Figure  11:  Number  of  inputs  processed  by  the  RB  as  a  function  of  time. 

plotted  in  Figure  10.  From  the  figure  we  can  see  that  the  probability  that  the  RB  has  failed 
by  any  given  time  0  decreases  with  increase  in  the  number  of  alternate  modules.  The  number 
of  inputs  processed  by  time  6,  N{6),  for  the  RB  with  1,  2,  and  3  alternate  modules  is  shown 
in  Figure  11.  The  value  of  N{6)  levels  off  beyond  9  =  400  min  for  the  system  with  1  alternate, 
since  the  RB  is  very  likely  to  have  failed  by  that  time. 

4.2  Extensions 

4.2.1  Clustering  in  the  input  data  stream 

The  failure  points  in  the  input  space  for  the  RB  tend  to  occur  in  clusters  [2].  The  sequence  of 
input  values  to  the  RB  tend  to  change  slowly  with  time,  thus,  given  a  failure  of  the  primary 
module  for  a  given  input,  there  is  a  greater  likelihood  of  it  failing  for  subsequent  inputs.  This 
clustering  behavior  in  the  input  data  stream  should  be  taken  into  account.  Csenki  [9]  considers 
a  discrete  time  Maxkov  model  of  a  RB  with  failure  clustering.  He  aissumes  that  the  system 
has  a  primary  module  and  a  single  alternate.  Given  that  the  primary  module  has  failed  for  a 
particular  input,  the  number  of  subsequent  inputs  for  which  the  module  fails  is  assumed  to  be 
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a  random  variable  The  length  of  this  additional  sequence  is  however  upper- bounded  by  a 
fixed  value  er.  Thus,  we  can  define  the  probabilities  p,-  =  Pr{^  =  t}  where  0  <  *  <  tr. 


Figure  12:  SRN  model  for  a  RB  with  clustered  failures. 

A  SRN  model  for  the  RB  with  clustered  fciilures  and  m  =  1  is  shown  in  Figure  12.  We 
assume  that  the  system  has  one  primary  module  and  a  single  alternate.  The  structure  of 
this  SRN  is  similar  to  that  of  the  original  SRN  in  Figure  8.  The  additional  places  Pena  and 
Pent  together  with  the  tramsitions  Tec,Tci,  model  the  clustering  in  the  input  space. 

Transitions  Tcq,  Te„  correspond  to  the  events  where  the  size  of  the  input  cluster  is  given 
by  0,  . . . ,  <r  respectively.  When  the  first  datum  of  the  input  sequence  that  causes  a  clustered 
failure  is  encountered,  it  causes  a  failure  of  the  primary  module.  Thus,  the  immediate  transition 
Teo  fires,  depositing  a  token  in  place  Pena.  Then  immediate  transitions  Tcq,  . . . ,  Tc„  become 
enabled.  The  weights  of  these  transitions  are  given  by  po>  •  ■  •  >  P<r  respectively.  Whenever 
transition  Tci  fires,  representing  the  fact  that  the  primary  module  will  fail  for  the  next  t, 


453 


Figure  13:  MTTF  for  the  RFJ  with  clustered  failures  as  a  function  of  cr. 

0  ^  ^  successive  inputs, «  tokens  are  deposited  in  place  Pent,  due  to  the  multiple  output 
arcs  from  transitions  I'd  to  place  Pent.  At  this  p>oint,  transitions  Pnco.  Tct,  7’e,  arc 
disabled  by  the  inhibitor  arcs  from  Pent.  Thereafter,  for  the  next  i  times  that  7'mo  fires, 
transition  Tcq  will  fire  depositing  a  token  in  Pena.  Transition  Tco  is  now  used  to  remove  a 
token  from  Pent  and  starting  the  usual  RB  sequence  corresp>onding  to  the  rase  where  the  first 
module  generates  and  erroneous  output  (token  in  Patco).  This  firing  sequence  continues  until 
Pent  is  empty.  To  achieve  this  behavior,  the  input  arc  from  Pent  to  Pco  has  multiplicity  1  if 
^(Pent)  >  1  and  0  otherwise. 

The  MTTF  as  a  function  of  a  is  plotted  in  Figure  13  (the  case  with  cr  =  0  conrraponds  to 
the  RB  with  no  clustered  failures).  Two  different  distributions  are  considered  for  p,,  uniform 
and  truncated  geometric.  For  the  uniform  distribution,  Vi,0  S  «  ::$  o,  p,  =  l/(cr  +  1).  In 
this  case,  given  that  a  failure  has  occurred,  the  probability  that  the  next  »  inputs  also  result 
in  a  failure  of  the  primary  is  the  -ame  for  all  i  (a  pessimistic  assumption).  For  the  truncated 
geometric  distribution,  Vi,0  <  i  <  <r,  Pi  =  p(l  -  p)V(l  -  (1  -  p)"'^’),  where  0  <  p  <  1  (in 
Figure  13,  we  set  p  =  0.5).  The  probability  that  the  failure  cluster  has  size  i  tapers  off  as 
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t  increases.  This  is  more  realistic  than  the  imiform  distribution.  Clustered  failures  have  a 
negative  impact  on  the  reliability  of  the  recovery  block.  The  effect  is  laurger  with  the  uniform 
distribution  than  with  the  truncated  geometric  distribution. 

4.2.2  Arrivals  of  requests 

So  fair,  we  have  assumed  that  when  the  RB  completes  processing  an  input,  another  input  value 
is  already  waiting  to  be  processed.  We  are  thus  assuming  a  100  %  utilization  of  the  recovery 
block.  In  reality,  the  input  vaJues  usually  airrive  at  random  and  are  processed  when  the  RB 
is  available.  To  incorporate  this  effect  into  the  model,  we  assume  a  Poisson  input  arrival 
process  with  rate  Axarr-  We_al3o  assume  that  the  RB  has  a  finite  buffer  of  size  A;  at  any 
time,  no  more  than  N  inputs  can  be  waiting  for  processing,  including  the  one  being  processed. 
This  is  implemented  by  adding  two  places,  Penv  and  Pbuf^  and  a  transition  Tarr,  resulting  in 
Figure  14.  The  firing  of  transition  Tarr  represents  the  arrival  of  an  input  datum  for  processing. 
Whenever  an  input  value  successfully  completes  execution  or  escapes  error  detection,  a  token 
is  added  back  into  place  Penv  thus  freeing  up  a  buffer. 

Since  the  RB  cannot  fail  while  it  is  not  being  used,  taking  into  aw:count  the  input  arrival 
process  incre^lses  the  time  to  failure.  This  is  reflected  in  Figure  15,  where  the  MTTF  is 
plotted  as  a  function  of  the  arrival  rate  Axorr  for  the  RB  with  1,  2,  and  3  alternate  modules, 
and  N  =  10.  When  Axarr  is  very  small,  we  notice  that  the  MTTF  is  large.  This  is  because 
there  is  a  greater  probability  of  the  RB  being  unused  for  longer  periods  of  time.  As  Xtarr 
increases,  the  MTTF  approaches  that  of  the  basic  system  considered  in  the  earlier  section; 
in  ftict  Ajorr  =  oo  corresponds  to  this  system.  This  is  understandable  since,  when  Axorr 
increases,  there  is  a  greater  chance  of  finding  an  input  datum  waiting  for  processing  when  the 
RB  completes  processing  an  earlier  input  value. 

4.3  Other  extensions 

The  software  recovery  block  is  executed  on  some  form  of  hardware  platform.  In  the  earlier 
sections,  we  have  implicitly  assumed  that  the  processor{8)  on  which  the  RB  is  being  executed  is 
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Pmo  Pbuf  Tarr  Penv 


Figure  14:  SRN  model  for  a  RB  with  input  arrivals. 

inherently  fault  free.  In  reality,  hardware  is  subject  to  failures  and  can  sometimes  be  repeured. 
Hence,  any  realistic  model  should  take  into  account  the  behavior  and  characteristics  of  the 
underlying  hardware  such  as  processors  and  memory  limitations.  It  is  easy  to  extend  our  models 
to  allow  for  the  failure/repair  behavior  of  the  processor(s)  or  other  hardware  components.  This 
will  then  allow  us  to  carry  out  the  combined  evaluation  of  hardware  and  software. 
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l.Oe-03  1.0fr02  l.Oe-01  l.Oe+00  l.Oe+01  l.Oe+02  l.Oe+03 

Arrival  Rate  Axorr 

Figure  15:  MTTF  for  a  RB  with  input  arrivals  as  a  function  of  Axorr- 

5  Conclusions 

In  this  paper,  we  presented  two  software  modeling  applications  where  SRNs  can  be  effectively 
used  to  geiin  insight  into  a  problem.  The  first  application  considers  a  producer-consumer 
tasking  system  with  an  intermediate  buffer  task,  and  studies  how  the  performance  is  affected 
by  different  selection  policies  when  multiple  tasks  are  ready  to  synchronize. 

The  second  application  studies  the  reliability  of  a  recovery  block  scheme.  The  initial  model 
is  incrementally  augmented  by  considering  the  possibility  of  clustered  failures  or  by  taking  into 
account  the  effective  arrival  rate  of  inputs  to  be  processed  by  the  system. 

In  either  model,  each  quantity  to  be  computed  is  defined  in  terms  of  either  the  expected 
value  of  a  reward  rate  in  steady-state  or  at  a  given  time  9,  or  as  the  expected  value  of  the 
accumulated  reward  until  absorption  or  imtil  a  given  time  9.  This  allows  extreme  flexibility 
while  maintaning  a  rigorous  formalization  of  these  quantities. 
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The  Network  Synthesis  System 

Background 

Modern  weapon  systems  have  extremely  complex  control,  computational,  and  information  processing 
requirements.  When  implemented,  they  include  tens,  or  even  hundreds,  of  computers  tied  together  in 
networks  with  various  degrees  of  intirriacy.  The  DoD  is  faced  with  the  problem  of  declining  budgets, 
and  the  attendant  reduction  in  both  numbers  of  designers  and  numbers  of  engineering  organizations, 
at  the  same  time  that  mission  complexity  and  system  complexity  are  increasing  greatly.  This  implies 
that  the  means  must  be  found  to  greatly  improve  the  productivity  of  the  smaller  cadre  of  designers 
and  to  create  increasingly  more  complex  and  high  quality  systems  with  limited  budgets. 

The  Network  Synthesis  System  is  an  integrated  design  automation  system  for  performing  Network 
Synthesis  for  specific  applications.  The  primary  objective  of  the  Network  Synthesis  System  is  to 
provide  a  smart  designer  with  the  tools  that  will  enable  him  to  mostly  automatically  synthesize  a 
network.  The  System  would  provide  facilities  for  synthesizing  and  evaluating  numerous  alternatives. 
It  would  enable  the  automated  synthesis  of  network  solutions  based  totally  on  the  requirements  (e.g., 
algorithm  X  must  execute  in  90  msec)  and  constraints  (e.g.,  the  resulting  hardware  must  weigh  no 
more  than  5  lbs.)  imposed  by  the  designer.  By  deriving  solutions  from  requirements  and  constraints, 
the  System  will  ensure  that  solutions  meet  specifications  and  enable  the  quantification  of  the  impact 
of  specification  changes.  Other  objectives  of  the  Network  Synthesis  System  are  involved  with  such 
issues  as  Design  Optimization  Strategies,  Validated  Hardware  and  Software  Parts  Libraries,  and  User 
Friendly  Interfaces. 

A  successful  Network  Synthesis  System  would  be  expected  to  reduce  design  time  by  a  factor  of  10  tfi 
100.  It  would  enable  the  practical  development  of  proof-of- concept  and  pre-production  prototypes  in 
an  almost  wholly  automated  way.  It  would  reduce  the  risk  and  cost  in  major  system  development. 
One  could  synthesize  and  evaluate  a  complex  network  in  days  with  such  a  tool;  a  process  that  now 
takes  months  and  years. 

The  NSS  will  be  developed  by  integrating  together  two  existing  tool  sets  and  by  providing  additional 
tools  to  support  the  network  design  process. 

The  two  existing  tool  sets  are: 

1.  Processing  Graph  Methodology  (PGM) 

The  PGM  is  a  tool  set  developed  by  the  Naval  Research  Laboratory  on  the  AN/UYS-  2  Program. 
PGM  tools  enable  the  design  of  an  application  system  to  be  done  at  a  high  level,  expressed  in 
graph  notation,  and  then  be  translated  into  code  strings  that  are  executable  in  the  functional 
elements  of  the  system.  It  includes  a  set  of  network  level  simulation  tools,  called  PGSE,  which 
provide  designers  with  the  means  for  evaluating  the  performance  of  their  designs.  It  also  includes 
comprehensive  graph  building  tools. 

2.  Integrated  Design  Automation  System  (IDAS) 


464 


IDAS  is  a  tool  set  developed  by  JRS  mostly  under  the  sponsorship  of  the  Tri-  Services  VHSIC 
Program.  The  JRS  tools  include  an  automatically  retargetable  Ada  Compiler,  an  Ada  Behavioral 
Specification  to  VHDL  Structural  Description  Synthesizer,  and  various  tools  to  assist  in  adapting 
Ada  to  the  special  characteristics  of  an  arbitrary  embedded  computer  and  supporting  application 
programmers  in  achieving  effective  results.  The  JRS  tools  are  unique,  particularly  as  they  relate 
to  Ada  and  VHDL.  Various  Simulators  are  also  included. 

Some  of  the  new  tools  to  be  developed  include: 

•  User  Interface  to  control  the  design  process 

•  Strategies  and  algorithms  for  the  various  design  optimizations. 

•  Library  Building  and  Management  Facilities  for  collecting,  storing,  and  accessing  relevant  data. 


System  Summary 

The  Network  Synthesis  System  (NSS)  will  provide  a  Designer  with  the  tools  and  methodology  needed  to 
synthesize  network  level  systems  consisting  of  computers  of  various  types  and  capabilities  (e.g.,  signal 
processors),  memories  of  various  types  and  sizes,  input/output  elements  of  various  types  9e.g.,  sensors, 
displays,  controls),  and  the  communication  elements  of  various  types  needed  to  link  the  pieces  of  a 
network  together  in  an  effective  manner.  With  the  NSS,  a  Designer  will  be  able  to  rapidly  prototype 
(in  model  form)  a  network  and  evaluate  it  objectively  against  its  specifications  (e.g.,  performance, 
functionality)  and  constraints  (e.g.,  limits  on  power,  size,  reliability). 

The  NSS  will  provide  an  integrated  collection  of  tools  to  support  a  comprehensive  design  methodology,. 
It  will  include: 

•  Network  Synthesis  Tools 

•  Processor  Synthesis  Tools 

•  Concurrent  Hardware/Software  Design 

•  Ada  and  VHDL  Languages 

•  Specification  First  Design 

•  Reusable  Part  Libraries  for  both  Software  (Ada)  and  Hardware  (VHDL) 

•  Simulation  and  Evaluation  Tools  covering  the  Design  Hierarchy  from  Networks  to  Components 

•  Estimating  Tools  incorporating  Rules  of  Thumb  and  Engineering  Judgment 

•  Physical  Package  Modelling,  Partitioning,  and  Assignment 

•  Integration  with  Lower  Level  Tools  (e.g.,  Silicon  Compilers) 
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Network  Synthesis  from  Specifications  and  Constraints 

The  network  synthesis  process  is  driven  by  the  network  requirement  specifications.  This  means  that 
the  NSS  will  take,  as  input,  a  behavioral  specification  of  the  network  to  be  synthesized  and  will  be 
capable  of  automatically  generating,  as  output,  a  structural  description  of  a  network  that  satisfies  the 
behavioral  specification. 

The  network  synthesis  process  is  controlled  by  the  Designer  via  the  imposition  of  constraints  on  the 
solution  space  available  to  the  NSS  in  transforming  the  behavioral  specifications  into  structural  de¬ 
scriptions.  “Constraints”  are  to  be  viewed  as  budgets  or  limits.  The  network  of  interest  may,  for 
example,  be  budgeted  to  no  more  than  100  watts  or  a  2000  hour  MTBF;  it  may  be  limited  to  three 
computers  or  to  only  computers  for  which  validated  Ada  compilers  exist.  These  “constraints”  will 
directly  affect  the  solution  that  is  generated. 

Each  particular  set  of  behavioral  specifications  and  constraints  imposed  on  the  solution  will,  in  general, 
lead  to  a  different  solution.  The  comparison  of  the  different  solutions,  in  terms  of  any  parameters  of, 
interest,  is  called  a  tradeoff.  Designers  perform  “tradeoffs”  to  measure  the  sensitivity  of  some  solution 
parameters  to  changes  in  one  or  more  of  the  behavioral  specifications  and  constraints.  They  perform 
“what  if?”  experiments.  Designers  will  vary  the  specifications  of  the  required  behavior  (e.g,,  change 
the  image  processing  algorithm)  or  modify  one  or  more  constraints  (e.g.,  up  the  Power  budget  to  200 
watts),  then  regenerate  the  solution,  and  then  compare  the  solution  parameters  to  those  of  other, 
previously  generated,  solutions.  Designers  are  generally  looking  to  find  the  best  solution,  that  satisfies 
all  of  the  requirements;  a  50  watt  solution  is  better  than  a  100  watt  solution,  even  if  the  budget  is  200 
watts. 

Behavioral  specifications  or  constraints  need  not  be  complete  or  finished  in  some  sense.  The  NSS  will 
generate  solutions  based  on  the  inputs  presented  and,  in  fact,  will  not  know,  or  be  concerned  with, 
whether  or  not  the  inputs  are  correct  or  complete.  This  capability  facilitates  what  if  experiments.  It  is 
necessary  for  a  methodology  wherein  a  design  evolves  as  more  knowledge  of  the  requirements  is  gained 
or  a  design  must  respond  to  an  abrupt  change  in  specifications  or  constraints.  It  is  also  necessary 
for  a  methodology  that  supports  the  concept  of  providing  reasonable  estimates  of  the  implications  of 
design  decisions;  that  is,  the  System  will  be  able  to  objectively  estimate  the  difference  in  the  power 
requirement,  for  example,  of  a  design  implemented  in  GaAs  versus  CMOS- 

The  solution  to  a  network  synthesis  problem  is  a  structural  network.  The  structural  network  consists 
of  a  collection  of  hardware  nodes,  of  a  variety  of  types  (e.g.,  computers,  sensors),  interconnected 
with  one  another  in  some  manner.  A  network  solution  is  obtained  when  all  of  the  nodes  and  all 
of  the  interconnections  are  realized;  realized  means  that  hardware  has  been  selected  or  designed  to 
implement  the  node  or  interconnection;  the  solution  obtained  must  satisfy  the  network  specifications 
and  constraints. 

Processor  Synthesis  from  Specifications  and  Constraints 

In  synthesizing  a  network,  a  solution  will  be  sought  that  utilizes  existing  hardware;  that  is,  hardware 
that  is  modelled  and  contauned  in  a  reusable  parts  library  and  for  which  a  physical  implementation 
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may  exist. 


At  times,  it  will  be  found  that  the  requirements  are  such  that  no  solution  can  be  found  when  restricted 
to  the  use  of  existing  hardware.  Then,  the  requirement  to  synthesize  a  new  part  (e.g.,  a  computer) 
might  arise.  In  a  similar  vein,  it  may  simply  be  of  interest  to  explore  the  design  space  and  generate 
new  library  entries  for  later  network  problems.  In  either  case,  whether  the  part  synthesis  requirement 
is  generated  by  a  network  synthesis  problem  or  by  an  off-line  library  building/validating  process,  a 
part  level  synthesis  tool  is  needed  to  satisfy  the  requirement. 

The  NSS  provides  a  Processor  Synthesis  tool  set  that  is  driven  by  behavioral  specifications  and  con¬ 
straints.  Behavioral  specifications  are  expressed  as  Ada  Programs.  Processor  synthesis  results  in  a 
solution,  which  is  a  computer  description  expressed  in  VHDL;  Ada  specifications  to  VHDL  Descrip¬ 
tions. 

Processors  are  synthesized  from  parts  in  a  library,  exactly  as  is  done  for  network  synthesis;  the  parts, 
at  this  design  level,  are  things  like  ALUs,  Memories,  Registers,  etc. 

Processor  synthesis  includes  numerous  optimization  elements  for  effecting  the  generation  of  good  qual¬ 
ity  designs,  not  simply  functionally  correct  ones. 


Concurrent  Hardware/Software  Design 

The  NSS  supports  concurrent  hardware/software  design.  It  provides  three  main  capabilities  in  this 
regard: 

•  Provides  facilities  for  performing  comprehensive  hardware/software  tradeoffs  and  for  measuring 
the  impacts  of  decisions  objectively  and  quantitatively. 

•  Provides  an  automatically  retargetable  Ada  Compiler  System,  that  is  retargeted  from  a  descrip¬ 
tion  of  a  computer  expressed  in  VHDL. 

•  Provides  comprehensive  analysis  and  tracing  facilities,  coupled  with  a  sophisticated  User  inter¬ 
face,  to  allow  Users  to  see  the  detailed  relationship  between  the  hardware  and  software.  For 
example,  one  can  see  detailed  hardware  utilization  data  associated  with  the  execution  of  any 
selected  program;  this  would  be  of  great  interest  for  real  time  evaluations  and  for  fault  tolerance 
analyses. 

Together,  these  capabilities  allow  a  Designer  to  proceed  with  hardware  and  software  design  and  imple¬ 
mentation  concurrently.  It  allows  for  evaluation  of  designs  almost  instantaneously  (i.e.,  within  hours) 
from  the  time  that  versions  are  identified.  It  allows  detailed  development  to  start  early  and  proceed 
with  confidence.  Integration  and  testing  is  started  immediately,  not  at  the  end  of  the  project. 
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Standards  -  Ada  and  VHDL 


The  NSS  utilizes  Standards  where  ever  practical  in  its  design  and  developme/i  it  is  anchored  to  the 
use  of  Ada  and  VHDL. 

Ada  is  utilized  within  the  NSS  in  the  conventional  manner  as  an  application  programming  language 
It  will  be  used  as  the  application  programming  language  for  all  types  of  applications  and  for  all  types 
of  computers.  Ada  is  also  used  as  the  behavioral  specification  language  for  a  computer  in  a  network; 
normal,  compilable  and  executable,  Ada  programs  will  be  used  for  this  purpose. 

VHDL  is  used  in  the  NSS  as  the  language  for  describing  hardware,  at  any  level  in  the  system  hierarchy 
from  network  to  gate.  The  NSS  will  contain  and/or  generate  VHDL  models  of  networks,  computers, 
and  lower  level  components. 


Specification  First  Design 

From  a  language  point  of  view,  synthesis  at  any  design  level  is  a  process  that  ultimately  transforms  a 
behavioral  specification  into  a  structural  description.  This  implies  that  one  should  first  determine  what 
an  entity  is  to  do  (i.e.,  its  behavioral  specification)  and  then  generate  and  examine  alternative  designs 
(i.e.,  solutions)  that  do  it;  this  is  an  eminently  reasonable  idea.  The  NSS  supports  this  methodology 
very  strongly. 

The  NSS  uses  Graphical  Representations  and  Ada  Programs  as  the  primary  mechanisms  for  expressing 
behavioral  specifications. 

At  the  network  level,  the  NSS  utilizes  data  flow  graphs  to  depict  the  behavior  and  to  express  the 
concurrency  possible  in  the  behavior.  Nodes  in  a  graph  are  reducible  to  a  “primitive”  level  at  which 
the  behavior  of  the  node  is  expressible  as  an  Ada  Program.  The  network  synthesis  process  transforms 
an  arbitrary  behavioral  network,  containing  an  arbitrary  number  of  primitive  nodes,  into  a  structural 
network,  having  a  specified  number  of  hardware  nodes,  of  specified  types,  interconnected  in  a  specified 
manner.  Normally,  a  network  will  exhibit  concurrency  of  behavior  on  a  “primitive"  node  level. 

At  the  processor  level,  the  NSS  utilizes  Ada  Programs  to  depict  the  behavior  required  and  to  express  the 
concurrency  possible  in  the  behavior  (as  discernible  in  the  data  dependency  graphs  for  the  program). 
The  Ada  Programs  used  may  ha.'e  originated  as  primitive  node  specifications  on  the  network  level;  if 
so,  the  connection  from  network  level  to  processor  level  is  seamless 

The  primitive  behaviors  at  the  processor  level  are  expressed  as  arithmetic  or  logical  operations.  These 
behaviors  are  expressible  as  logic  equations,  truth  tables,  or  similar  representations,  which  can  then 
be  passed  to  logic  synthesizers  for  synthesis  of  elements  like  register  files  and  RALUs. 


Reusable  Parts  -  Hardware  and  Software 

The  NSS  incorporates,  and  relies  heavily  on,  the  concept  of  rpusable  parts.  It  includes  libraries  of 
hardware  elements  (e.g.,  processors,  displays,  ALUs,  memories)  and  software  elements  (i.e  ,  algorithms 
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expressed  in  Ada  ).  It  also  includes  “reusable  data"  elemenis  that  represent  the  reiattonshjps  be»v.»-t  n 
parts  in  reusable  libraries;  for  example,  reusable  data  includes  the  executiurs  time  of  an  aig.-injun 
(from  the  reusable  Ada  library)  on  a  processor  (from  the  reus.ible  computet  library  in  VHIU. )  »•>  tt 
data  element  that  is  maintained. 

Synthesis  on  a  network  level  involves  reusable  parts  very  heavily.  Primitive  behavioral  nodes  are  from 
the  Ada  library.  Selectable  computers  are  obtained  form  the  VHDL  library  when  construe  ting  ih*- 
structural  graph.  Relational  data  elements  are  used  in  deciding  how  to  assign  behavioral  nodes  to 
structural  elements. 

Reusable  Part  Libraries  are  conceptually  very  important  for  two  other  reasons 


•  Provides  the  mechanism  for  enforcing  “validation"  of  parts  and  the  use  of  validated  paiis 

•  Facilitates  the  synthesizing  of  parts,  for  all  des  "n  levels,  in  a  manner  independent  of  t  h«*  ter  hno 
and  schedule  requirements  of  a  project.  Algorithm  designers,  for  example,  could  do  their  thing 
and  deposit  the  results  into  a  library  off  line  from  any  specific  project 


Simulation  and  Evaluation  Tools 

The  NSS  contains  the  capability  to  simulate  and  evaluate  designs  from  the  network  to  the  c omponent 
level  It  contains  the  following: 

•  Network  Level  Behavioral  Simulation 

•  Network  Level  Structural  Simulation 

•  Processor  Level  Software  Simulation  (VHDL) 

•  Processor  RTL  Level  Simulation  (VHDL) 

•  Component  Level  Simulation  (VHDL) 

These  tools  are  tightly  integrated  into  the  design  methodology  and  are  used  to  generate  data  upon 
which  design  selection  and  optimization  decisions  are  made 


Estimating  Tools 

The  NSS  includes  the  capability  of  rapidly  generating  estimates.  When  performing  synthesis  from  a 
high  level,  network  or  processor,  a  designer  will  be  interested  in  estimates  of  various  design  parameter 
values  that  would  result  from  alternative  constraints  that  might  be  imposed.  For  example,  the  designer 
would  like  to  know  the  estimated  MTBF  for  a  given  design  implemented  in  GaAs  and  Packaging  Srheme 
1  versus  CMOS  and  Packaging  Scheme  3 
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The  NSS  provides  rapid  estimate  generation  based  on  measures  of  complexity  and  “good  engin  ring 
judgment”  or  “rules  of  thumb”.  The  capability  is  data  driven,  which  means  that  an>  organiraliun  «>r 
individual  may  establish  unique  estimating  factors  for  any  combination  of  constraints  of  interest. 


Physical  Package  Modelling,  Partitioning,  and  Assignment 

The  NSS  provides  the  facilities  for  modelling  a  set  of  physical  packages  (e  g.,  chassis,  board,  carrier, 
hybrid,  IC)  in  terms  of  their  capacities.  It  also  provides  the  capability  of  partitioning  an  int  rconnected 
set  of  hardware  components,  at  any  structural  level,  into  subsets  that  match  the  capacities  of  the 
packages.  Together,  these  capabilities  provide  the  ability  to  do  multilevel  assignment. 


Integration  with  Lower  Level  Tools 

The  NSS  utilizes  standard  languages  and  formats  to  facilitate  its  interconnection  to  other  tools.  It  has 
been  interfaced  to  VHDL  compiler  systems,  a  silicon  compiler  system,  various  simulation  tools,  and 
various  software  assemblers/linkers/loaders.  It  is  soon  to  be  interfaced  to  a  MOSIS  supported  silicon 
compiler. 
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Abstract 

Reliability  analysis  of  various  disk  array  architectures  (different  levels  of  RAID)  is  per¬ 
formed.  The  dependence  of  reliability  and  mean  time  to  data  loss  on  different  parameters 
of  a  disk  array  and  support  hardware  components  needed  for  correct  functioning  of  disk 
array  is  characterized.  A  study  of  these  characteristics  reveals  the  implications  of  several 
design  issues  of  a  disk  array  on  its  reliability.  Issues  like  scalability  of  disk  arrays,  imper¬ 
fect  coverage  of  disk  failures,  cold  versus  hot  disk  spares,  predictive  disk  failures,  reliability 
of  disk  arrays  for  mission-critical  computer  systems,  serial  versus  orthogonal  placement 
of  support  hardware  with  respect  to  disk  groups,  and  levels  of  hardware  redundancy  are 
studied. 

1  Introduction 

To  achieve  high  computer  systems  performance,  the  performance  of  its  components  must 
increase  in  proportion  to  each  other.  Unfortunately,  I/O  storage  systems  have  not  been  able 
to  match  the  high  performance  of  the  CPUs  and  memories.  To  bridge  this  performance  gap, 
high-performance  disk  array  architectures  have  been  proposed.  Given  similar  performance 
and  cost  per  megabyte,  higher  performance  can  be  obtained  by  using  an  array  of  smaller 
disks  compared  to  a  single  large  disk.  More  arms  can  be  provided  and  requests  that  access 
only  a  single  disk  can  be  serviced  independently.  Large  requests  that  need  to  access  several 
disks  can  be  serviced  much  faster  by  performing  data-transfer  in  parallel. 

'This  work  was  supported  in  part  by  the  National  Science  Foundation  under  Grant  CCR-9108114  and 
by  the  Office  of  Naval  Research  under  Grant  N00014-91-J-4162 
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However,  an  array  of  disks  is  more  fault-prone  than  a  single  large  drive.  Assuming  the 
MTTF  (mean  time  to  failure)  of  a  large  disk  drive  to  be  the  same  as  the  MTTF  of  a  small 
disk,  an  array  of  a  hundred  disks  would  have  an  MTTF  that  is  one  hundredth  of  that  of  a 
large  drive.  Even  if  the  assumption  made  above  is  not  correct,  it  is  not  hard  to  see  that  the 
reliability  of  a  disk  array  system  can  be  much  less  than  required.  Failure  of  a  disk  may  result 
in  loss  or  corruption  of  data.  In  many  applications,  data  is  accumulated  through  years  of 
research  efforts  and  extensive  experimentation.  Loss  of  critical  data  is  clearly  unacceptable 
as  it  may  lead  to  financial  loss  or  loss  of  life.  Thus,  the  demand  is  to  design  cost-effective 
disk  systems  which  can  not  only  deliver  high-performance  but  also  provide  high  reliability. 

The  purpose  of  this  paper  is  to  quantify  the  reliability  and  mean  time  to  data  loss  of 
disk  array  architectures  and  provide  answers  to  important  questions  arising  in  the  design  of 
a  disk  array  by  means  of  analytic  models.  Gibson  [3],  Schulze  et  al  [15],  and  Patterson  et  al 
[12]  have  analyzed  reliability  of  different  RAID  architectures  in  terms  of  mean  time  to  data 
loss  (MTDL).  Bitton  and  Gray  [1]  have  analyzed  MTTF  for  mirrored  disks.  However,  these 
approaches  use  simple  approximations.  The  most  comprehensive  effort  to  analyze  RAID 
reliability  in  terms  of  different  parameters  is  the  work  by  Gibson  and  Patterson  [4].  They 
compute  MTDL  for  disk  array  models  based  on  certain  approximations.  For  these  models, 
they  assume  that  time  to  failure  of  a  group  of  disks  is  exponentially  distributed.  Based  on 
this  assumption,  they  compute  the  reliability  of  disk  arrays  using  the  approximate  value  of 
MTDL.  Usefulness  of  these  approximations  is  that  MTDL  and  reliability  can  be  expressed 
in  closed  form. 

It  is  easy  to  see  that  time  to  failure  of  a  group  of  disks  is  not  exponentially  distributed 
even  when  individual  disk  failure  times  are.  Moreover,  the  emphasis  on  MTDL  as  a  metric 
to  evaluate  the  reliability  of  disk  arrays  could  be  misleading.  The  variance  of  time  to  data 
loss  for  a  typical  disk  array  is  very  large  and  the  actual  time  to  data  loss  in  practice  may 
differ  significantly  from  the  MTDL.  In  this  paper,  we  develop  hierarchiccd  reliability  models 
for  RAID  architectures  and  perform  exact  analysis  of  these  models.  Use  of  hierarchical 
modeling  provides  a  better  understanding  of  the  architecture  being  modeled  and  keeps  the 
state  space  of  models  small.  To  make  the  models  realistic,  we  take  into  account  several 
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factors  which  have  hitherto  not  been  considered,  mainly,  imperfect  coverage  of  disk  failures, 
disk  failure  prediction,  and  the  type  of  spares  (cold  or  hot).  Even  for  slightly  advanced 
models,  closed  form  solutions  become  very  messy  or  impossible.  Therefore  we  provide 
numerical  solutions.  We  emphasize  reliability  more  than  the  MTDL  as  a  metric  to  evaluate 
different  RAID  architectures. 

Our  analysis  not  only  provides  a  comparison  between  different  disk  array  architectures, 
but  also  provides  answers  to  specific  questions  about  disk  arrays  in  general,  such  as  :  1) 
How  reliable  should  each  individual  disk  be?  2)  Are  disk  arrays  reliable  enough  for  mission- 
critical  systems?  3)  Should  the  disk  spares  be  kept  hot  or  cold?  4)  How  much  redundancy 
in  disk  spares  is  needed?  5)  Just  how  small  should  the  data  reconstruction  time  be?  6) 
Should  hardware  redundancy  be  scaled  as  the  dimensions  of  disk  arrays  are  scaled?  7)  How 
much  better  is  the  orthogonal  placement  of  support  hardware  than  serial  placement?  and 

The  rest  of  this  paper  is  organized  as  follows.  In  Section  2,  we  describe  a  general 
hierarchical  reliability  model  for  a  general  disk  array  architecture.  This  model  fits  several 
of  the  disk  array  architectures  which  we  briefly  describe.  In  Section  3,  we  extend  this  model 
for  the  architectures  which  rely  on  disk  controllers  for  disk  failure  detection  and  location.  In 
Section  4,  we  develop  reliability  models  for  different  RAID  architectures  with  finite  number 
of  cold  and  hot  disk  spares.  In  Section  5,  we  include  support  hardware  components  into 
our  model.  We  develop  models  for  two  diflferent  hardware  organizations  for  RAID  ;  serial 
and  orthogonal  placement  of  support  hardware  with  respect  to  disk  groups.  In  Section  6, 
numerical  results  obtained  from  the  solution  of  these  models  are  presented  and  discussed. 
Finally,  we  present  our  conclusions  based  on  these  results  in  Section  7. 

2  Fault-Tolerant  Disk  Array  Architectures 

Several  fault- tolerant  disk  array  architectures  have  been  introduced  by  different  researchers 
using  varying  degrees  of  hardware  redundancy  [1,  7,  8,  9,  11, 14].  Patterson  et  al  [11]  coined 
the  term  RAID  (Redundant  Array  of  Inexpensive  Disks)  for  such  disk  array  systems  with 
redundancy.  They  unified  the  existing  disk  system  architectures  as  different  levels  of  RAID 
(levels  1,2, 3,4)  and  proposed  a  new  high-performance  disk  system  architecture  (RAID  level 
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5). 


We  now  introduce  the  terminology  and  state  the  assumptions  used  throughout  the  paper. 
The  disk  array  consists  of  N  groups  of  disks.  Each  group  has  D  data  disks  and  C  check 
disks.  We  assume  that  time  to  failure  of  each  disk  is  exponentially  distributed  with  mean  1  /A 
(MTTF).  Failure  of  a  single  disk  in  a  group  is  tolerable  since  lost  data  can  be  reconstructed. 
Data  reconstruction  process  consists  of  two  steps  :  disk  replacement  and  data  construction 
on  the  replaced  disk.  Disk  replacement  consists  of  bringing  in  a  spare  disk  to  replace  the 
failed  disk.  Data  construction  is  accomplished  using  the  data  from  the  working  disks  and 
parity  information  of  the  group.  We  initially  assume  that  each  group  has  a  spare  disk  which 
can  be  electronicaJly  switched  in  when  a  disk  fjols.  We  further  assume  that  the  spare  disk 
does  not  fail  as  long  as  it  is  not  switched  in. 

However,  if  another  disk  in  the  same  group  fails  while  the  reconstruction  is  underway, 
then  data  is  lost  (can  not  be  reconstructed)  and  that  group  of  disks  is  considered  failed.  We 
assume  that  each  group  has  its  own  reconstruction  mechanism  independent  of  other  disks. 
Thus,  reconstruction  could  be  carried  out  for  more  than  one  group  at  the  same  time.  All 
the  disks  are  identical  (come  from  the  same  manufacturer).  Unless  otherwise  stated,  the 
data  reconstruction  time  is  assumed  to  be  exponentially  distributed  with  mean  1/fi.  Failure 
of  any  group  results  in  the  failure  of  disk  array  so  that  data  loss  in  any  group  constitutes 
the  failure  of  disk  array. 

We  define  data-reliability  as  the  probability  that  no  data  loss  occurs  until  time  t.  Thus, 
data-reliability  is  same  as  the  reliability  of  disk  array  and  it  is  expressed  as  a  function  of 
time  t.  Based  on  these  assumptions,  we  construct  a  two-level  hierarchical  reliability  model 
for  a  general  disk  array  architecture  as  described  in  [6].  The  reliability  of  the  disk  array  is 
modeled  by  a  reliability  block  diagram  (RBD)  shown  in  Figure  1.  This  upper  level  model 
has  a  series  structure.  Each  block  represents  a  group  of  disks.  Unitl  Section  5,  we  assume 
that  groups  behave  independently  of  each  other.  If  Ri{t)  is  the  reliability  of  group  i,  then 
reliability  of  the  disk  array  is  given  by  : 

N 

Rda{t)  =  JlRi{t).  (1) 

To  compute  the  reliability  of  a  group,  we  use  a  simple  Markov  model  shown  in  Figure  2. 
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Figure  1:  Reliability  block  diagram  for  RAID 
{D  +  C)Xp  {D  +  C-l)X 


iD  +  C)Xil-p) 


Figure  2;  Markov  reliability  model  for  a  single  group  of  disks  (RAID-1,2,3,4,5) 

In  state  2,  all  the  disks  in  a  group  are  operational.  After  one  of  the  disks  fails,  system 
state  changes  from  2  to  1  and  data  reconstruction  is  initiated.  However,  the  disk  array 
keeps  functioning  since  data  is  available.  If  any  other  disk  in  the  group  f<iils  before  the 
reconstruction  is  completed,  then  data  is  lost  and  the  disk  array  is  considered  failed.  State 
0  is  the  group  failed  state. 

Note  that  we  allow  imperfect  coverage  of  faults.  A  fault  in  a  disk  is  covered  with 
probability  p  and  not  covered  with  probability  1  -  p.  An  uncovered  fault  in  a  group  causes 
data  loss.  Bit  errors  that  are  not  detected,  not  corrected,  or  miscorrected  are  manifestations 
of  what  we  call  uncovered  faults.  These  faults  may  occur  due  to  some  fault  within  the 
error- correcting  code  (ECC)  or  if  ECC  is  not  properly  invoked  when  an  error  is  detected. 
Moreover,  ECC  can  correct  only  single  bit  errors.  Occurrence  of  multiple  bit  errors  (which 
may  happen  due  to  some  extraneous  electric  signal),  are  accounted  for  by  imperfect  coverage. 
Fadlures  in  support  hardware  (e.g.,  failure  of  cooling  equipment)  may  cause  an  unrecoverable 
failure  in  a  disk.  Some  disk  array  architectures  rely  on  the  array  controller’s  ability  to 
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detect  disk  failures.  Failure  of  the  disk  controller  mechanism  to  detect  a  disk  failure  is  also 
considered  an  uncovered  failure.  Imperfect  coverage  also  accounts  for  catastrophic  failures 
due  to  extreme  environmental  conditions. 

The  reliability  of  any  group  G,-  is  given  by  : 

Ri(i)  =  ,  (2) 


where 


01 

02 


A2 


~ifi  +  (2D  +  2C-  1)A)  4-  +  +  X,i{{4D  -|-  4C)p  -  2) 

2 

-in  +  (2D  +  2C-  1)A)  +  +  Xfi{{4D  +  4C)p  -  2l 

2 

((D  -h  C)(l  +  p)  -  1)A  +  n  +  01 

01-02 

{{D  -h  C)(l  +  p)  -  1)A  +  n  A  02 
02  -  01 


(3) 


The  reliability  of  the  disk  array  is  given  by  Equation  1.  Mean  time  to  data  loss  for  a  group 
of  disks  is  : 

n  +  iiD  +  C)ii  +  p)-i)\ 


(D  +  C)A(/x(l  -  p)  +  (D  +  C-  l)X)  • 
The  mean  time  to  data  loss  for  the  disk  array  is  given  by  ; 

MTDLi,  =  /  RUm  =  Y.  a  ^  ' 

Jo  f^Q0i3  A  02iN  -  3) 


(4) 


(5) 


2.1  RAID-1  (Mirrored  Disks) 

Mirroring  is  the  traditional  approach  to  improve  the  reliability  of  disk  systems.  Bitton 
and  Gray  [1]  introduced  the  concept  of  disk  shadowing  in  which  a  shadow  set  of  k  disks 
(i.e.,  k  identical  copies  of  same  data)  are  maintained.  This  set  can  support  k  reads  in 
parallel  assuming  parallel  datapaths  and  enough  disk  controllers  (thus  effectively  increasing 
the  read  rate  by  a  factor  of  fc).  A  write  is  performed  in  parallel  over  k  disks  (thereby 
maintaining  a  write  rate  of  single  disk).  We  concern  ourselves  with  the  case  where  k  —  2. 
This  configuration  is  also  known  as  disk  duplexing  or  mirroring. 

If  the  storage  capacity  of  the  system  requires  N  disks,  then  2N  disks  are  used  in  a 
duplex  system.  This  is  the  most  expensive  of  the  different  RAID  architectures.  It  is  also 
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significantly  different  from  the  rest  of  RAID  architectures.  Each  pair  of  mirrored  disks 
forms  a  group.  If  one  of  the  disks  in  a  group  fails,  ther  a  spare  is  switched  in.  The  data 
reconstruction  consists  of  copying  the  data  onto  the  spare  from  the  other  working  disk.  For 
the  case  of  RAID-1,  we  substitute  D  =  1  and  C  =  1  in  Equations  1-5. 

2.2  RAID-2 

In  this  scheme,  each  group  has  D  data  disks  and  C  (where  C  >  log2{D  +  C  -1-  1))  check 
disks  [5].  The  check  disks  can  correct  single  bit  errors  and  detect  double  bit  errors.  A  single 
failure  of  any  disk  in  a  group  is  tolerable.  An  uncovered  fault  or  failure  of  two  or  more  disks 
causes  data  loss.  Substituting  the  appropriate  values  of  D  and  C  in  Equations  1  -  5  yield 
the  reliability  and  MTDL  for  this  organization  for  RAID. 

2.3  RAID-3,4,5 

The  C  check  disks  for  D  data  disks  in  RAID-2  are  basically  needed  to  detect  the  incorrect 
bit  position.  Once  the  incorrect  bit  has  been  identified,  then  a  single  parity  bit  suffices  for 
correction  (reconstruction)  of  data  which  would  otherwise  be  permanently  lost.  In  RAID 
levels  3,4,  and  5,  the  ability  of  the  disk  controller  to  detect  a  failed  disk  is  utilized.  Thus,  we 
need  only  one  check  disk  per  group  since  the  disk  controller  identifies  the  failed  bit  position. 

A  RAID-3  architecture  was  proposed  by  Park  and  Balasubramaniam  [10]  and  a  RAID-4 
architecture  was  proposed  in  [14],  RAID-4  differs  from  RAID-3  in  that  the  data  is  inter¬ 
leaved  between  disks  at  sector  level.  In  RAID-3,  data  is  interleaved  at  bit  level.  Thus,  in 
RAID-4,  an  I/O  transfer  is  spread  across  all  the  disks  within  a  group.  Whereas  RAID-3 
allows  only  one  I/O  transfer  per  unit  time  per  group,  RAID-4  allows  parallel  transfers  from 
a  group.  However,  only  the  reads  are  parallelized.  Writes  are  limited  to  one  per  group  at 
a  time  since  every  write  request  results  in  a  read  and  write  to  the  parity  disk.  Therefore, 
RAID-4  results  in  improved  performance  for  reads. 

To  parallelize  writes,  RAID-5  architecture  was  proposed  by  Patterson  et  al  [11].  In  this 
scheme,  parity  information  is  spread  across  all  the  disks  within  a  group  (rotated  parity). 
This  scheme  results  in  improved  performance  for  reads  as  well  as  writes.  However,  the 
reliability  models  for  RAID-3,  RAID-4,  and  RAID-S  are  identical  since  they  all  require  only 
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one  check  disk.  Substituting  C  =  1  in  Equations  1-5  yields  the  desired  results  for  reliability 
and  MTDL. 

3  Predictive  Disk  Failures 

RAID-3,4,  and  5  rely  on  the  disk  controller’s  ability  to  correctly  predict  the  disk  failures 
before  they  occur.  Some  implementations  utilize  this  property  to  prevent  data  loss  and 
reduce  the  reconstruction  time.  We  modify  our  earlier  model  to  account  for  these  new 
features.  We  assume  that  no  loss  of  data  occurs  if  the  disk  controller  correctly  predicts 
an  impending  disk  failure.  We  further  assume  that  the  spare  is  electronically  switched  in 
and  data  copied  onto  the  spare  before  the  failing  disk  is  powered  down.  This  sequence  of 
operations  does  not  result  in  a  change  of  state  of  the  system.  However,  the  disk  controller 
may  not  always  be  able  to  predict  a  disk  failure.  Failures  resulting  from  uncovered  faults 
are  not  predictable.  With  probability  (1  -  o),  an  impending  failure  due  to  a  covered  fault 
is  not  predicted. 

There  is  also  the  possibility  of  false  alarms  when  the  disk  controller  erroneously  predicts 
a  disk  f«dlure.  The  time  to  next  false  alarm  is  assumed  to  be  exponentially  distributed  with 
rate  7.  However,  false  alarms  are  treated  as  correctly  predicted  failures  and  do  not  result 
in  a  change  in  system  state.  This  is  because  we  assume  unlimited  supply  of  disk  spares. 
However,  it  does  result  in  monetary  loss  since  a  false  alarm  results  in  undue  consumption  of 
disk  spares.  In  a  later  section,  when  we  consider  finite  number  of  spares,  this  effect  of  false 
alarms  is  clear.  The  Markov  reliability  model  based  on  these  assumptions  for  each  group  is 
shown  in  Figure  3. 

The  reliability  of  each  group  has  the  same  form  as  Equation  2  where 

-(((D  +  C)(2 -pa)-l)X  +  fi)  +  VX 
2 

-(((D  +  C)(2  -  pg)  -  1)A  -b  /r)  -  y/A 
2 

A*  i-D  C*  —  1)A 

^1-^2 

/?2  +  A*  +  (-P  +  C*  —  1)A 

/?2  - 


/3i  = 
(32  = 

Ai  = 

A2  = 
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{D  +  C)Xp{l  -a)  (Z»  +  C  -  1)A 


Figure  3:  Markov  reliability  model  of  a  group  of  disks  with  predictive  failures  (RAID-3,4,5) 

X  =  ((jD  -h  C)((D  -i-  C)p'^a^  -1-  -2pa)  l)X^  +  + 

2((D -I- C)p(2  -  a)  -  l)A/i  . 


Mean  time  to  data  loss  for  a  group  of  disks  is  given  by  ; 


Mrnr  p  +  ((l>  +  c)(i-hp(i-a))- i)A 

‘  (B  +  C)X(p(l-p)-t-(I?-hC-l)(l-a)X)- 


(6) 


The  reliability  block  diagram  for  the  disk  array  is  the  same  as  shown  in  Figure  1.  The 
reliability  of  the  disk  array  is  computed  by  Equation  1  and  MTDL  for  the  disk  array  is 
computed  as  in  Equation  5. 


4  Cold  Disk  Spares  Versus  Hot  Disk  Spares 

In  the  earlier  sections,  we  assumed  unlimited  supply  of  disk  spares  that  did  not  fail.  In 
reality,  however,  a  fixed  number  of  spares  is  maintained.  The  disk  spares  could  be  main¬ 
tained  hot  or  cold.  A  hot  disk  spare  can  fail  even  though  it  is  not  in  active  use.  A  cold 
disk  spare  does  not  fail  unless  it  is  switched  in  as  a  replacement  for  a  failed  disk.  A  hot 
spare  can  be  switched  in  electronically  after  a  disk  fails  and  the  time  to  perform  the  switch- 
in  is  negli^ble.  Hence  the  effective  data  reconstruction  time  in  this  case  consists  only  of 
constructing  data  on  the  spare  disk.  The  disadvantages  of  hot  disk  spares  are  :  1)  The 
automated  switch-in  mechanism  adds  to  the  cost  overhead,  2)  Spares  can  fail  while  not  in 
active  use,  and  3)  The  hardware  used  to  carry  out  spare  switch-in  may  fail  (these  failures 
can  be  accounted  for  by  the  coverage  probability). 
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Figure  4:  Reliability  model  of  a  group  of  disks  with  M  cold  spares  (RAID- 1,2) 

The  other  option  is  to  maintain  cold  disk  spares.  When  a  disk  fails,  typically  a  repair¬ 
person  is  caOed  to  install  a  spare.  After  installation,  the  disk  reconfiguration  and  data 
reconstruction  begins.  Thus  the  total  repair-time  increases.  A  better  solution  perhaps  is 
a  combination  of  two  approaches.  Few  hot  spares  could  be  maintained,  while  the  rest  are 
kept  cold.  Each  time  a  disk  fails,  a  hot  spare  is  used  up  and  a  cold  spare  is  made  hot. 

If  a  disk  fails  after  all  the  spares  are  exhausted,  then  a  new  disk  is  ordered  from  the 
manufacturer.  This  increases  the  data  reconstruction  time.  In  practice,  this  could  be 
avoided  by  always  maintaining  a  minimum  number  of  spares.  Each  time  the  number  of 
available  spares  falls  below  this  minimum,  new  disks  can  be  ordered  from  the  manufacturer. 
This  scenario  well  approximates  the  case  of  unlimited  spares.  Maintaining  spares  (hot  or 
cold)  is  a  cost  overhead.  Some  users  may  prefer  not  to  maintain  any  spares.  This  strategy 
could  result  in  savings  depending  upon  several  factors  including  loss  of  revenue  while  data 
reconstruction  takes  place  and  the  reliability  and  availability  desired.  If  disk  spares  are 
maintained,  then  the  question  arises  as  to  how  many  spares  should  be  kept  ?  Intuitively,  it 
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Figure  5:  Reliability  model  of  a  group  of  disks  with  M  cold  spares  and  predictive  failures 
(RAID-3,4,5) 

is  dear  that  if  there  are  small  number  of  groups  each  containing  a  large  number  of  disks, 
then  number  of  spares  per  group  should  be  large.  However,  if  there  are  a  large  number  of 
gt^  .ps  each  containing  a  small  number  of  disks,  then  number  of  spare,s  per  group  should 
be  small. 

In  this  Section,  we  develop  reliability  models  for  difTe.eiu  RAID  architectures  based 
upon  different  assumptions.  Assume  that  each  group  has  M  spare  disks.  Assume  that 
for  hot  disk  spares,  the  time  to  switch  in  a  spare  is  negligible.  Let  us  first  consider  the 
reliability  model  of  a  group  of  disks  with  M  cold  spares  for  RAID- 1,2  (Figure  4).  A  state  is 
a  two-tuple  (i,y)  where  i  is  the  number  of  active  disks  (operational  data  and  check  disks) 
and  j  is  the  number  of  disk  spares  left.  State  A  is  the  group  failed  stale.  In  this  model, 
G  =  D  A  C ,  the  number  of  active  disks,  Aj  =  (/?  -f  C)Xjp  where  Xj  is  the  failure  rate  of  a 
disk  in  active  use,  A4  =  (D-fC-l)Arf,  and  A3  =  (D-}-C)Aa((l -p).  The  transitions  with  rate 
A4  are  transitions  representing  failure  of  a  disk  during  data-recon.struction  which  results  in 
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data  loss.  Transitions  with  rate  A3  represent  uncovered  failure.  Transitioii.s  with  rate  Aj 
represent  covered  failure  of  a  disk.  Rate  of  data-reconstruclion  is  /is  as  long  as  at  least  one 
disk  spare  is  available.  In  state  {G,0),  all  the  spares  are  exhausted.  If  a  disk  fails  in  this 
state,  the  data- reconstruction  rate  is  /tj  where  /ij  <  /13  because  mean  data-reconsiructiou 
time  increases. 

In  case  of  RAID-3,4,5  with  cold  spares,  the  reliability  model  is  shown  in  Figure  5.  Here 
A2  =  (D  +  C)Xdpa  +  7  and  A,  =  {D  +  C)A,ip(l  -  o)  where  a  and  7  are  the  same  as  defined 
in  Section  3.  Assumi.  that  data-rcconstruction  time  for  correctly  predicted  failures  and  false 
alarms  is  negligible.  A  spare  is  installed  before  the  failing  disk  is  powered  down.  Thus,  these 
transitions  result  only  in  a  stale  change  reflecting  the  decrease  in  the  number  of  available 
spares  by  one.  The  rates  A3,  and  /x^  are  same  as  the  previous  model. 

Let  us  now  consider  reliability  model  for  RAID-3,4,5  with  hot  spares  (Figure  G).  1  ids 
model  differs  from  the  earlier  model  in  that  there  are  transitions  from  states  ( G  -  1 ,  ;\f  -  j  -f  1 ) 
to  states  {G  -  \,M  -  i)  {i  =  1,A/)  signifying  the  failure  of  hot  disk  spares.  This  also 
changes  the  transition  rates  from  states  {G,  M  - « +  1)  to  (G,  M  ~  i)  (where  1=1,  A/ ).  Rate 
A,,,  =  t  A,p  4-  (£>  -f  C)Xjpa  +  7  where  A,p  is  the  failure  rate  of  a  hot  disk  spare  and  A,p  <  Aj. 
Data-reconstruction  rate  while  at  least  one  spare  is  available  is  pj  and  pj  <  Pi-  In  case  of 
RAID-1,2  with  hot  spares,  the  reliability  model  remains  the  same  but  the  transition  rates 
change.  Particularly,  Aj  -  {D  +  C)Xjp  and  A,.,  =  iA,p. 

5  Reliability  Model  of  RAID  with  Support  Hardware 

A  disk  array  system  has  many  hardware  components  that  are  needed  for  proper  functioning 
of  the  disk  array.  These  include  host  bus  adaptor  (HBA),  disk  array  controller  (DC),  hard 
disk  drive  (HDD)  controller,  single  board  controller  (SBC)  (track  buffer  and  error  correction 
circuitry  (ECC)  are  resident  in  SBC),  cooling  hardware,  and  power  supply  etc  So  far  wp 
have  considered  only  the  disks  in  the  reliability  models  of  disk  arrays  (coverage  probability 
accounted  for  failure  of  some  support  hardware  components  though).  Schulze  et  al  jl5]  have 
shown  that  failures  of  the  support  hardware  considerably  reduces  the  overall  reliability  of 
a  disk  array.  In  fact,  failure  of  some  of  the  support  hardware  components  may  result  in 
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Figure  6:  Reliability  model  of  a  group  of  disks  with  M  hot  spares 

data  loss.  For  instance,  failure  of  cooling  equipment  may  result  in  data  corruption  in  disks 
and  fmlure  of  power  supply  may  result  in  data  loss.  In  this  Section,  we  model  failure- repair 
behavior  of  support  hardware  components  on  reliability  of  disk  array.  We  consider  two 
hardware  organizational  schemes. 

5.1  Serial  Placement  of  Support  Hardware 

This  is  a  simple  organizational  scheme  in  which  the  disk  array  has  a  set  of  associated  hard¬ 
ware  components  (HBA,  power  supply  (PS),  cooling  fans  (CF),  HDD,  SBC,  disk  controller 
etc.).  These  components  are  placed  serially  with  the  disk  array.  If  there  is  no  redundancy 
in  these  components,  then  failure  of  any  of  these  components  results  in  the  failure  of  the 
disk  array.  If  there  is  redundancy,  then  failures  of  some  of  these  components  can  be  toler¬ 
ated  and  failed  components  may  be  repaired  or  replaced.  However,  the  placement  is  serial 
and  if  any  component  were  to  stop  functioning  despite  the  redundancy,  the  disk  array  is 
considered  failed.  The  two-stage  hierarchical  reliability  model  is  shown  in  Figure  7.  The 
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Figure  7:  RAID  organization  with  serial  placement  of  support  hardware 

RED  is  a  simple  extension  of  RED  shown  in  Figure  1.  The  reliability  of  the  disk  array  is 
now  given  by  : 

N 

Rdait)  =  (H  Riit))RkUt)RMRsbcmps{t)Rcf{t) .  (7) 

1=1 

Depending  upon  the  number  of  redundant  spares  of  each  component,  the  reliability 
model  for  each  component  is  different.  We  show  Markov  reliability  models  for  components 
with  no  redundancy,  dual  redundancy,  and  triple  redundancy.  These  models  are  easily 
extended  to  higher  levels  of  redundancy.  In  each  of  these  models,  A  is  the  failure  rate  of  the 
component,  fi  is  the  repair  rate,  and  c  is  the  coverage  probability.  We  assume  hot  spares 
for  hardware  components  which  fail  at  the  same  rate  A. 
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5.2  Orthogonal  Placement  of  Support  Hardware 

Schulze  et  al  [15]  proposed  an  organizational  scheme  for  placement  of  disks  and  the  support 
hardware  in  a  way  that  makes  disk  arrays  more  fault-tolerant.  The  disks  are  organized  in 
a  two  dimensional  grid  with  each  row  representing  a  parity  group  of  disks.  In  RAID,  each 
parity  group  can  tolerate  single  disk  failures.  Support  hardware  (power  supply,  cooling 
fans,  HBA,  etc.)  is  provided  for  each  column  of  disks.  Thus,  each  column  forms  a  support 
hardware  group.  This  organization  is  shown  in  Figure  8.  This  orthogonal  placement  of 
parity  groups  against  support  hardware  groups  provides  fault-tolerance  against  failure  of 
support  hardware  components.  Disk  array  is  operational  even  if  all  the  disks  in  a  column 
group  or  any  support  hardware  components  along  a  column  fail.  However,  disk  array  in  this 
case  is  more  expensive  than  the  serial  organization  because  of  large  number  of  associated 
hardware  components. 

In  [15],  an  approximate  estimate  for  the  MTTF  of  a  disk  array  organized  in  this  manner 
is  provided.  Due  to  complex  dependence  of  failure  and  data-reconstruction  in  this  RAID 
organization,  a  simple  reliability  model  can  not  be  developed.  We  developed  a  reliability 
model  using  stochastic  Petri  nets  but  it  resulted  in  a  very  large  Markov  chain.  The  symmetry 
in  this  model  prompted  us  to  develop  a  smaller  approximate  model.  The  approximate  model 
is  obtained  by  essentially  lumping  identical  states  into  one  state.  The  approximate  Markov 
model  has  only  four  states  and  yielded  solutions  that  were  close  to  the  solutions  obtained 
using  the  exact  model. 

The  approximate  model  is  shown  in  Figure  9.  State  2  is  the  fuUy  operational  state  of  the 
disk  array  with  no  disk  or  hardware  component  failed.  Assume  that  all  the  support  hardware 
components  are  statistically  independent  and  identical  across  different  columns.  Failure  of 
any  support  hardware  component  in  a  column  causes  the  entire  column  to  fail.  Assuming 
exponential  time  to  failure  distribution  for  each  of  the  support  hardware  component,  the 
failure  rate  of  each  hardware  column  A,/,  is  the  sum  of  failure  rate  of  each  component  (CF, 
PS,  HBA,  DC,  etc.)  and  time  to  failure  distribution  of  each  hardware  column  is  exponential 
with  this  rate  [16].  In  state  2,  one  of  the  hardware  columns  may  fail  (transition  to  state 
1)  and  it  is  repaired  at  rate  For  the  lack  of  real  data,  we  assume  that  each  hardware 
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Figure  8:  RAID  organization  with  orthogonal  placement  of  support  hardware 
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Figure  Approximate  reliability  model  for  orthogonal  RAID- 1,2, 3,4 ,5 

component  has  the  same  MTTR.  If  this  is  not  the  case,  then  simple  extensions  to  the 
reliability  model  <  an  be  made  by  introducing  different  failure  states  for  failures  of  different 
hardware  components  and  their  repair. 

A  failure  in  a  hardware  column  is  covered  with  probability  pj*.  An  uncovered  failure 
causes  the  disk  a 'ray  to  fail.  While  the  repair  is  underway,  the  disks  in  this  column  are 
considered  unope  ational.  However,  if  any  of  the  remaining  {N  -  1){D  -f  C)  disks  fails  or 
if  any  of  other  cr  'umn  hardware  groups  fail  before  the  repair  is  completed,  then  data  loss 
occurs  (transition  to  failed  state  0).  Similarly,  in  state  2,  any  of  the  N{D  +  C)  disks  may  fail 
(Aj  is  the  disk  failure  rate)  (transition  to  state  1).  A  disk  failure  is  covered  with  probability 
Pd.  An  uncovered  disk  failure  causes  data  loss.  Data- reconstruction  rate  is  pj.  If  any  of  the 
other  disks  in  this  group  fail  or  if  any  of  the  other  hardware  column  groups  fail  before  the 
data-reconstruction  is  completed,  then  data  loss  occurs. 

6  Numerical  Results 

We  conducted  several  experiments  with  the  models  described  in  previous  sections.  It  is 
hard  to  compare  the  different  levels  of  RAID  because  it  is  not  quite  clear,  for  example,  what 
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constitutes  an  equivalent  RAID-1  organization  given  a  RAID-5  organization.  We  designed 
our  models  assuming  a  required  storage  capacity  of  the  disk  array  (32  disks).  We  also 
assume  that  each  level  of  RAID  uses  identical  disks  (same  capacity  and  same  mean  time  to 
failure)  which  presumably  come  from  the  same  manufacturer.  Given  these  assumptions,  we 
stress  that  the  reader  should  lay  emphasis  on  the  characteristics  of  individual  curves  (each 
curve  corresponds  to  a  RAID  level)  and  not  on  comparing  different  curves  on  the  same  plot. 
For  instance,  given  a  RAID-5  with  16  groups  of  16  data  disks  each,  its  equivalent  RAlD-1 
would  have  256  data  disks  and  256  mirrored  disks  of  same  capacity.  This  is  exorbitantly 
redundant.  A  system  designer  would  rather  implement  a  RAID- 1  architecture  with  8  data 
disks  of  larger  capacity. 

The  base  model  we  use  for  RAID-1  has  32  data  disks  and  32  mirrored  disks.  For  RAID- 
2,  the  model  has  8  groups  each  consisting  of  4  data  disks  and  3  check  disks.  For  RAID-3, 
4,  and  5,  the  base  model  has  8  groups  of  4  data  disks  and  1  check  disk.  The  numerical 
values  of  some  of  the  model  parameters  are  chosen  based  upon  the  data  given  in  [11,  15). 
The  time  to  failure  of  an  active  disk  (data  and  check  disks)  is  exponentially  distributed 
with  mean  40000  hours  (A  =  1/40000  per  hour).  We  assume  the  distribution  of  time  to 
data  reconstruction  is  exponential  with  rate  fi.  If  hot  disk  spares  are  maintained,  mean 
data-reconstruction  time  is  2  hours  (/x  =  1/2  per  hour).  If  cold  disk  spares  are  maintained, 
then  mean  data-reconstruction  time  is  50  hours.  If  no  disk  spares  are  maintained,  then 
mean  data-reconstruction  time  is  74  hours.  Failure  rate  of  a  hot  spare  disk  is  X,p  =  1/50000 
per  hour.  The  coverage  probability  in  each  case  is  assumed  to  be  0.9 

In  models  with  predictive  disk  failures,  rate  of  false  alarm  is  chosen  to  be  7  =  1/100000 
per  hour  and  probability  that  an  impending  failure  is  correctly  predicted  is  chosen  to  be 
a  =  0.9.  For  models  with  support  hardware  components,  we  have  from  the  data  provided  in 
[15],  MTTF  for  power  supply  =  1460  hours,  HBA  =  123000  hours,  power  cable  =  10000000 
hours,  SCSI  cable  =  21000000  hours,  cooling  equipment  =  195000  hours,  SBC  =  40000  hours 
and  HDD  controller  =  30000  hours.  We  take  mean  repair  time  for  any  support  hardware 
component  (also  MTTR  for  any  support  hardware  column)  to  be  24  hours  (/x,/,  =  1/24  per 
hour). 
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Figure  10:  Reliability  v/s  MTTF  disk  in  hours 

All  these  two-level  hierarchical  Markov  models  were  solved  using  the  software  package 
SHARPE  [13]. 

6.1  How  reliable  should  each  disk  be? 

The  two-level  hierarchical  model  composed  of  submodels  shown  in  Figures  1  and  2  (RAID- 
1,2)  or  3  (RAID-3,4,5),  was  solved.  Assume  that  hot  disk  spares  are  maintained  in  each 
case  and  a  spare  is  available  each  time  a  disk  fails.  Figure  10  shows  how  the  reliability 
(evaluated  at  f  =  1000  hour)  of  various  disk  array  architectures  varies  as  MTTF  of  a  single 
disk  increases.  The  reliability  gain  is  significant  as  MTTF  of  each  disk  is  increased  from 
10000  hours  to  40000  hours.  However,  the  gain  in  reliability  is  not  much  as  MTTF  of  a  disk 
is  increased  beyond  40000  hours. 

6.2  Are  RAID  architectures  reliable  enough  for  mission-critical  systems? 

Once  again,  the  same  two-level  hierarchical  models  as  the  earlier  experiment  are  solved.  In 
Figure  11,  reliability  of  the  disk  array  is  plotted  as  a  function  of  mission  time.  Disk  arrays 
are  highly  reliable  for  operation  period  of  500  hours  or  less.  However,  for  mission-critical 
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Figure  11:  Reliability  v/s  Time  in  hours 

systems  with  long  mission  times,  we  see  that  different  disk  array  architectures  with  the 
given  amount  of  redundancy  are  not  very  reliable.  Note  that  in  these  models,  we  have  not 
taken  into  account  the  reliability  of  support  hardware  which  decreases  the  reliability  of  disk 
array. 

6,3  How  much  does  improved  coverage  of  disk  failures  help? 

We  solved  two-level  hierarchical  model  composed  of  submodels  shown  in  Figure  1  and  2 
(RAID-2)  or  3  (RAID-3,4,5).  In  Figure  12,  reliability  of  the  disk  array  (at  t  =  1000  hours) 
is  plotted  as  a  function  of  coverage  probability.  Tremendous  improvement  in  rehability  is 
achieved  with  improved  coverage  of  disk  failures.  Whereas  it  is  impossible  to  get  rid  of 
catastrophic  failures,  it  is  possible  to  improve  the  reliability  of  support  hardware,  ECC, 
and  disk  controllers  by  introducing  redundancy.  It  should  be  noted  that  coverage  proba¬ 
bility  may  well  depend  upon  the  RAID  architecture.  It  can  be  argued  that  RAID-1  will 
have  higher  coverage  probability  than  RAID-3,4,5,  because  error  detection  and  correction 
strategy  of  RAID-3,4,5  is  more  prone  to  uncovered  failures. 
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Figure  12:  Reliability  v/s  Coverage 

6.4  How  low  should  the  data-reconstruction  time  be? 

Let  us  now  consider  the  effect  of  mean  time  to  data-reconstruction  (MTDR)  on  disk  array 
reliability.  The  two-level  hierarchical  model  composed  of  submodels  shown  in  Figures  1 
and  2  (RAID- 1,2)  or  3  (RAID-3,4,5)  was  solved.  The  plots  in  Figure  13  show  that  data- 
reconstruction  time  does  not  significantly  affect  the  reliability  of  disk  system.  Reliability 
is  evaluated  at  time  t  =  1000  hours.  Varying  MTDR  from  2  hours  to  100  hours  did 
not  change  disk  array  reliability  much.  The  reason  for  this  is  because  the  MTTF  of  a 
disk  is  much  larger  than  MTDR  (mean  time  to  data-reconstruction)  of  disk.  Thus,  taking 
expensive  measures  to  reduce  the  data-reconstruction  time  would  not  yield  significant  gains. 
The  virtual  independence  of  array  reliability  on  data-reconstruction  time  suggests  the  use 
of  cold  disk  spares  over  hot  disk  spares.  Hot  spares  are  prone  to  failure  just  like  data  disks, 
but  cold  spares  do  not  fail.  Besides,  it  is  more  expensive  to  maintain  hot  spares  than  cold 
spares.  However,  depending  upon  th<  application,  it  may  be  useful  for  a  variety  of  reasons 
(loss  of  revenue,  user  requirements  etc.)  to  minimize  the  data-reconstruction  time. 
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Figure  13:  Reliability  v/s  Mean  time  to  data-reconstruction 
6.5  How  many  disk  spares  are  needed? 

In  the  previous  models,  we  assumed  unlimited  number  of  hot  disk  spares  were  maintained. 
In  Section  4,  we  saw  that  with  some  data-reconstruction  cost  overhead,  it  is  possible  to 
ensure  approximately  permanent  availability  of  a  disk  spare.  In  this  section,  we  analyze 
the  dependence  of  array  reliability  on  the  number  of  spares  and  the  kind  of  spares  (hot 
or  cold).  We  solved  the  two-level  hierarchical  model  composed  of  models  shown  in  Figure 
1  and  4.  (RAID-1,2  with  cold  disk  spares)  or  5  (RAID-3,4,5  with  cold  disk  spares)  or  6 
(RAID-1,2,3,4,5  with  hot  disk  spares)  to  analyze  the  dependence  of  MTDL  on  the  number 
of  spares. 

In  these  models,  we  choose  =  1/50.0  per  hour  (data-reconstruction  rate  when  a 
cold  spare  is  available),  fi2  =  1/74.0  per  hour  (data-reconstruction  rate  when  no  spares  are 
available),  and  fii  =  1/2,0  per  hour  (data-reconstruction  rate  when  a  hot  spare  is  available 
on  site).  The  MTTF  of  a  hot  disk  spare  is  taken  to  be  50000.0  hours  (A,p  =  1/50000.0  per 
hour),  which  is  larger  than  the  MTTF  of  an  active  disk  (=  40000.0  hours). 

For  RAID-3,4,5,  MTDL  of  disk  array  as  a  function  of  number  of  spares  (cold  and  hot) 


494 


Figure  14:  MTDL  of  RAID-3,4,5  v/s  Number  of  spares 

is  increased  from  zero  to  three  is  plotted  in  Figure  14.  The  relative  gain  in  MTDL  is  not 
much  supporting  the  results  of  section  6.4.  The  absolute  gain  in  MTDL  is  not  much  as 
the  number  of  spares  is  increased  beyond  two  per  group.  Similar  trends  are  observed  for 
RAID-1  and  RAID-2. 

6.6  Is  RAID  reliability  scalable? 

A  natural  step  towards  more  parallelism  in  I/O  transfers  would  be  to  scale  the  disk  arrays 
in  two  dimensions.  One  is  to  increase  the  number  of  disks  in  a  group  and  the  other  is  to 
increase  the  number  of  groups  (or  both).  The  other  reason  to  scale  disk  arrays  could  be 
simply  an  increased  demand  on  storage  capacity.  We  wish  to  find  out  if  the  reliability  of 
the  disk  arrays  scales  appropriately  or  not.  RAID-1  architecture  is  obviously  not  scalable. 
Increasing  the  number  of  disks  reduces  array  reliability  and  MTDL.  Moreover,  it  is  not  cost 
effective  to  duplicate  aU  the  disks  in  the  system  if  we  intend  to  use  over  100  small  disks. 
Thus,  for  RAID-1  architecture,  the  number  of  disks  should  be  kept  small  and  the  size  of 
each  disk  should  be  increased.  Important  thing  to  remember  is  that  MTTF  of  a  large  disk  is 
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not  significantly  lower  compared  to  MTTF  of  a  small  disk.  Thus,  the  above  solution  yields 
a  more  reliable  design  of  RAID-1, 

Given  the  storage  capacity  for  RAID-2,3,4,  and  5,  the  choice  of  number  of  data  disks  in 
a  group  and  number  of  groups  is  dictated  mainly  by  performance  considerations  (e.g.,  the 
amount  of  parallelism  in  I/O  transfers)  and  support  hardware  available  (e.g.,  the  number  of 
I/O  channels,  array  controllers  etc.).  We  illustrate  how  these  choices  affect  array  reliability. 
The  two-level  hierarchical  model  solved  in  this  case  is  composed  of  submodels  shown  in 
Figures  1  and  2  (RAID-2)  or  3  (RAID-3,4,5). 

Given  a  fixed  number  of  disks  in  a  group  (D  =  8),  the  number  of  groups  is  varied. 
Figure  15  illustrates  how  reliability  decreases  as  N  increases.  Next,  given  a  fixed  number 
of  groups  (N  =  8),  the  number  of  disks  in  a  group  (D)  is  varied.  Figure  16  shows  how 
reliability  decreases  with  increase  in  D.  Reliability  in  both  the  cases  is  evaluated  at  time  t 
=  1000  hours.  These  plots  reveal  that  reliability  of  all  the  RAID  architectures  falls  below 
acceptable  levels  as  the  disk  arrays  are  scaled  up  in  dimensions.  A  simple  solution  is  to 
scale  the  redundancy  in  hardware  as  the  dimensions  of  a  disk  array  are  scaled.  Another 
solution  would  be  to  come  up  with  newer  designs  of  RAID  with  more  fault-tolerance  like 
the  one  suggested  in  [2]. 

6.7  How  much  do  we  gain  by  orthogonal  placement  of  support  hardware? 

We  now  analyze  the  gains  in  reliability  due  to  orthogonal  placement  of  support  hardware 
components  over  the  serial  placement.  Assume  that  there  is  no  hardware  redundancy  in 
either  organization.  For  serial  organization,  the  two-level  hierarchical  model  composed  of 
submodels  shown  in  Figures  7  and  2  (RAID-1,2)  or  3  (RAID-3,4,5)  is  solved.  For  orthogonal 
organization,  the  me .  el  shown  in  Figure  9  is  solved.  The  array  reliability  is  plotted  as  a 
function  of  time  in  Figure 

Comparing  the  two  curves  (eg.,  RAID-1  (srl)  and  RAID-1  (otg)),  we  find  that  orthogonal 
placement  of  support  hardware  improves  array  reliability.  A  key  pattern  to  note  is  that  for 
orthogonal  organization,  RAID-1  architecture  has  higher  reliability  than  RAID-3,4,5  or 
RAID-2  (as  opposed  to  all  the  earlier  plots).  This  happens  because  there  are  only  two 
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RA!D-i  -0- 
RAID-2  -f- 
RAID-3A.5  -B- 


Figure  17:  Serial  RAID  reliability  as  a  function  of  time  (hours) 

hardware  columns  in  RAID-1.  Thus,  RAID-1  benefits  the  most  from  orthogonal  placement 
as  far  as  improvement  in  reliability  is  concerned.  Moreover,  the  overhead  cost  due  to 
multiple  support  hardware  components  (one  for  each  column)  of  RAID-1  is  the  least  among 
different  RAID  organizations.  In  serial  placement  of  hardware,  the  reliabilities  of  different 
RAID  architectures  are  almost  the  same.  This  happens  because  of  the  very  small  MTTF 
of  support  hardware  («  1-160  hours)  compared  with  the  MTTF  of  disks. 

7  Conclusion 

We  have  carried  out  reliability  analysis  of  different  fault-tolerant  disk  array  architectures 
classified  as  different  levels  of  RAID.  The  reliability  models  formally  capture  the  operational 
dependency  of  disk  array  system  on  array  organization  and  support  hardware.  Solution 
of  these  models  provides  useful  insight  into  the  dependence  of  disk  array  reliability  or. 
parameters  such  as  MTTF  of  disks,  mean  data-recon-struction  time,  coverage  of  faults,  and 
dimensions  of  disk  arrays. 

If  the  disk  array  is  intended  for  a  mission-critical  application  with  long  mission  times. 
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then  it  must  possess  high  reliability  over  a  long  period  of  time.  Our  results  show  that 
none  of  the  RAID  architectures  meet  the  ultra-high  reliability  requirements  in  their  present 
implementation.  However,  introduction  of  additional  redundancy  for  key  components  may 
help  achieve  desired  reliability.  If  the  MTTF  of  a  single  disk  is  increased  beyond  a  certain 
value,  the  gains  in  disk  array  reliability  are  not  significant.  Thus,  ultra-reliable  individual 
disks  are  not  the  solution  for  ultra-reliable  disk  arrays.  Reducing  data-reconstruction  time 
does  not  yield  significant  array  reliability  gain  either.  However,  tremendous  improvement 
in  disk  airray  reliability  can  be  obtained  with  improved  coverage  of  faults  (i.e.,  improved 
fault-detection  and  more  reliable  support  hardware).  Thus,  the  key  to  improving  disk  array 
reliability  is  superior  fault  coverage.  Dimensional  scaling  of  disk  arrays  results  in  reliability 
degradation.  Therefore,  hardware  redundancy  must  be  scaled  as  the  dimensions  of  disk 
arrays  are  scaled  to  maintain  high  reliability. 

If  reliabiUty  of  support  hardware  (Power  supply,  cooling  hardware,  array  controller, 
host  bus  adaptor  etc.)  is  taken  into  account,  then  overall  reliability  of  disk  array  decreases. 
Orthogonal  placement  of  support  hardware  increases  the  overhead  cost  but  significantly 
improves  the  reliability.  The  power  supply  is  the  bottleneck  of  support  hardware  reliability 
and  theiefore  of  disk  array  reliability.  The  best  gains  in  reliability  are  achieved  if  repair 
times  for  support  hardware  are  reduced.  One  way  of  achieving  this  is  to  maintain  spares 
for  each  component. 

We  expect  tbc:";  models  and  results  to  be  useful  to  designers  of  disk  array  architectures 
during  the  design  as  well  as  operational  stages  since  these  models  reveal  the  bottlenecks 
in  array  reliability.  Appropriate  steps  could  be  taken  either  by  modifying  the  array  design 
during  the  design  stages  or  by  introducing  additional  hardware  redundancy  if  the  system 
is  already  in  operation.  Given  specified  reliability  requirements,  these  models  also  test 
whether  a  given  disk  array  architecture  meets  those  specifications  or  not. 
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RAPID  PROTOTYPING 
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INTRODUCTION 

Navy  systems  are  becoming  increasingly  complex.  Development 
of  these  systems  will  require  changing  the  methods  by  which  these 
systems  are  designed.  This  change  should  include  the  integration 
of  prototyping  into  the  entire  design  cycle.  Currently, 
prototyping  is  often  delayed  until  many  critical  design  decisions 
have  been  constrained.  This  leads  to  inferior  systems  because 
many  of  these  design  decisions  were  not  optimal.  Through  the 
integration  of  prototyping  into  the  design  cycle,  the  system 
designer  has  increased  capabilities  to  check  the  feasibility  of 
specifications  and  requirements,  compare  the  performance  of 
alternate  design  choices  through  trade-off  analysis  of  hardware, 
software,  or  humanware  implementations,  or  to  consider  the 
performance  of  different  algorithms.  Rapid  prototyping  is 
necessary  for  prototyping  to  be  a  practical  aid  in  the  design 
cycle. 


PROTOTYPING  USEFULNESS 

There  are  two  major  reasons  for  prototype  development:  to 
demonstrate  proof-of-concept  for  high  risk  sections  of  a  system 
and  to  assist  in  the  development  of  the  final  product. 

Successful  integration  of  prototyping  into  the  system  design 
process  requires  that  system  designers  and  management  understand 
the  differences  between  these  two  purposes. 

The  proof-of-concept  prototype  is  usually  classified  as  a 
"throw-away"  prototype.  It  is  developed  to  answer  questions 
about  one  particular  high-risk  section  of  a  project.  In  general, 
as  shown  in  figure  1,  the  greater  the  risk  associated  with  the 
prototype,  the  more  likely  the  prototype  will  be  thrown  away. 

This  prototype  aids  the  developers  either  in  determining  the 
feasibility  of  an  approach  or  by  allowing  exploration  of  the  best 
method  by  which  to  solve  a  problem.  Proof-of-concept  prototyping 
roust  be  done  early  in  the  design  cycle  because  the  answers  it 
provides  v/ill  usually  drastically  shape  the  final  system.  Also, 
the  proof-of-concept  prototype  can  show  the  proposed  system  is 
not  feasible  and  thus  signify  that  a  major  review  of  the  system 
concept  is  required.  Two  dangers  affect  the  development  of  the 
proof-of-concept  prototype.  First,  management  may  oppose  the 
expenditure  of  resources  to  develop  a  prototype  which  will  be 
thrown  away.  This  view  deprives  the  developers  of  the  knowledge 
and  experience  gained  through  proof-of-concept  development. 
Secondly,  management  and  designers  may  be  tempted  to  incorporate 
the  prototype  into  the  final  product.  However,  proof-of-concept 
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Characterizing  Prototyping  Processes 
A  Dimension:  Degree  of  Experimental  Intent 


Figure  1.  (Proceedings:  Spring  1992  Prototech  Community  Meeting) 

prototypes  are  generally  not  designed  for  maintainability,  fault- 
tolerance,  reliability,  or  security. 

Those  prototypes  which  aid  in  the  development  of  the  final 
product  are  usually  classified  as  "evolutionary"  because  they 
often  gradually  evolve  into  the  product.  Early  in  the  design 
cycle,  they  can  clarify  requirements,  determine  the  sufficiency 
of  the  requirements,  and  facilitate  communication  between  system 
designers  and  management.  Later  in  the  design  cycle,  they  can 
allow  trade-off  analysis  between  different  design  decisions. 
Finally,  once  portions  of  the  system  are  implemented,  these 
portions  can  often  be  Integrated  into  the  prototype  to  allow 
better  performance  measurement  and  prediction.  Communication, 
trade-off  analysis,  and  performance  measurements  are  some  of  the 
major  strengths  of  evolutionary  prototyping.  Caution  must  be 
used;  however,  to  insure  that  the  system  is  designed  for 
maintainability,  fault-tolerance,  and  reliability. 


PROTOTYPING  APPROACHES 

To  meet  the  purposes  of  proof-of-concept  and  product 
development  prototypes,  several  different  approaches  have 
emerged.  Examples  of  products  are  described  to  illustrate  the 
different  prototyping  approaches.  For  certriin  approaches,  many 
other  similar  products  also  exist  other  than  those  dcsvcribed. 
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simulation  -  Simulation  uses  software  to  model  components  of  the 
system.  These  components  can  be  hardware,  software,  or 
humanware.  Simulation  provides  an  ability  to  measure 
performance,  perform  rapid  trade-off  analysis,  and  integrate 
actual  pieces  of  software  as  they  a';  e  developed.  Disadvantages 
of  simulation  are  that  large  development  efforts  can  be  required, 
fairly  detailed  designs  can  be  required,  and  actual  hardware 
usually  can  not  be  used. 

IDAS  -  Integrated  Design  Automation  System,  JRS  Research 
Laboratories,  Inc.  IDAS  provides  the  tools  necessary  to  map  a 
particular  program  onto  a  given  architecture , to  measure  the 
quality  of  that  mapping,  and  to  detect  where  and  under  what 
conditions  that  mapping  fails  to  be  optimum.  IDAS  provides  quick 
and  efficient  evaluation  of  design  alternatives,  execution  of 
desired  benchmarks  on  simulators  customized  for  specific  machine 
designs,  and  allows  the  designer  to  iteratively  improve  that 
design  until  the  desired  level  of  near  optimum  solution  has  been 
reached.  From  this  architecture,  trade-off  analysis  can  be 
performed  to  determine  an  architecture  which  satisfies  non¬ 
functional  restraints  such  as  size,  cost,  weight,  and  power. ^ 

SES  Workbench  -  This  product  allows  a  system  to  be  modeled 
as  a  collection  of  components  such  as  processors,  resources,  and 
delays.  From  this  system,  information  such  as  timing,  resource 
contention,  and  program  design  can  be  obtained.  It  provides  the 
ability  to  perform  rapid  trade-offs  in  components  and  in  their 
configuration.^ 

Prototyping  languages  -  These  languages  provide  the  designer  with 
the  ability  to  work  at  a  higher  level  of  abstraction  than 
previously  available  to  describe  their  domain.  Much  research  is 
being  done  to  develop  the  higher  level  prototyping  languages. 

The  hardware  description  languages  are  well  defined  and  in  use; 
whereas,  Proteus  and  other  high-level  prototyping  languages  are 
still  in  the  experimental  stage. 

VHDL  -  VHSIC  hardware  description  language  supports  the 
design,  description,  and  efficient  simulation  of  VHSIC 
components.  Its  ability  to  describe  hardware  at  various 
abstractions  from  the  high-level,  systems-oriented  view  down  to 
the  gate  level  make  it  appropriate  for  prototyping.  VHDL  is  the 
DOD  standard  for  describing  VHSIC  hardware.^ 

Verilog  -  This  is  an  alternate  VHSIC  hardware  description 
language.  Similar  to  VHDL,  it  is  becoming  popular  in  the 
development  of  commercial  products. 

Proteus  -  Proteus  provides  an  extensible  set  of  high-level 
architecture-independent  primitives  for  parallel  and  distributed 
computation;  a  data  model  supporting  algebraic  specification;  and 
an  identification  of  specifications  and  types  which  supports 
object-oriented  notions  of  subtyping  and  inheritance.  It 
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provides  a  formal  concept  of  module  refinement  whose 
implementation  supports  both  the  evolutionary  model  of  software 
development,  through  the  refinement  of  control  and  data- 
abstractions  to  improve  the  efficiency  of  early  prototypes,  as 
well  as  architectural  targeting  by  refining  architecture- 
independent  prototypes  into  restricted  forms  targeting  execution 
on  specific  parallel  platforms.* 


Software/hardware  module  interconnection  schemes  -  This  approach 
is  based  on  maintaining  a  library  of  components  and  having  the 
ability  to  connect  them  together  to  form  the  system.  This  is  one 
of  the  principal  areas  of  prototyping  research. 

CAPS  -  Computer-Aided  Prototyping  System,  Naval  Postgraduate 
school.  The  CAPS  method  uses  a  prototyping  language  to  design  a 
hierarchical  system  structure  with  real-time  and  control 
annotations  and  automatic  code  generation  together  with  reusable 
components  to  produce  executable  prototypes.^ 

PERTS  -  A  Prototyping  Environment  for  Real-Time  Systems, 
University  of  Illinois.  PERTS  will  provide  an  environment  for 
the  use  and  evaluation  of  new  design  approaches,  for 
experimentation  with  alternative  system  building  blocks,  and  for 
the  analysis  and  performance  profiling  of  prototype  real-time 
systems 

NSS  -  Network  Synthesis  System,  JRS  Research,  Inc.  This 
tool  provides  the  ability  to  rapidly  prototype  a  network  and 
evaluate  it  against  its  specifications  and  constraints.  The  tool 
will  rely  on  a  reusable  parts  library  for  both  software  (Ada)  and 
Hardware  (VHDL) 


User  interface  developers  -  Environments  which  support  user- 
friendly  interface  development  are  extremely  important  for 
communication  between  the  designer  and  management.  A  generally 
accepted  practice  is  for  designers  and  management  to  walk-through 
the  user  interface  screens  before  much  of  the  functionality  has 
been  implemented.  At  these  walk-throughs,  errors  in  requirements 
or  requirement  understanding  are  often  detected.  Many  commercial 
products  exist  which  support  user  interface  development. 


RESEARCH  AREAS 

Although  many  promising  approaches  to  prototyping  are 
currently  available,  there  remain  many  critical  technologies 
which  must  be  developed  in  order  to  comprehensively  support  the 
design  cycle.  Some  of  these  areas  are  listed  below. 

Module  Interconnection  Formalisms  -  These  formalisms  will  allow 
the  development  and  management  of  software  databases  and  the 
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mechanism  for  interconnecting  existing  software  independent  of 
the  language  used  to  implement  it,  the  platform  it  runs  on,  or 
the  communication  media  available  to  access  it.  With  the  many 
research  efforts  emphasizing  prototyping  through  module 
interconnection,  it  is  important  for  these  formalisms  to  be 
developed  and  adopted  in  the  near  future.  Currently,  the  DARPA 
Prototech  community  is  investing  this  issue.® 

Information  Abstraction  -  To  allow  analysis  at  a  high-level  and 
simplify  lower  level  analysis,  information  is  abstracted  away. 
This  presents  a  danger  that  the  information  will  lose  its 
identification  with  reality.  Without  the  ability  to  reference 
reality,  the  validity  of  the  prototype  is  difficult  to  determine 
because  it  is  no  longer  possible  to  determine  the  validity  of  the 
initial  information.  Also,  information  abstraction  currently 
hides  timing  and  resource  utilization  information  limiting  the 
usefulness  of  prototyping  to  complex,  real-time  applications.’ 

Prototype  migration  -  Current  techniques  do  not  permit  the 
migration  of  one  design  phase  prototype  to  the  next  phase.  This 
limits  the  integration  of  prototyping  in  th ^  design  process 
because  of  the  effort  required  to  prototype  at  each  stage  and  the 
inability  to  support  a  spiral  design  cycle  since  information  can 
not  easily  be  passed  between  phases. 


CONCLUSIONS 

Future  Navy  systems  will  require  a  departure  from  current 
system  development  practices  if  they  are  to  have  higher 
performance;  better  reliability,  fault-tolerance,  security,  and 
maintainability;  faster  development  time;  and  lower  cost.  One 
promising  approach  to  improve  design  methodology  is  to  integrate 
rapid  prototyping  into  the  design  cycle.  For  this  approach  to 
succeed,  designers  and  management  must  understand  the  different 
uses  for  prototypes.  Currently  many  developers  do  not 
distinguish  between  different  types  of  prototypes  and  as  a 
result,  prototypes  are  viewed  as  a  great  expense  with  little 
return.  Additionally,  the  integration  of  prototyping  into  the 
design  cycle  will  require  the  development  of  tools  and  techniques 
that  permit  the  designer  greater  flexibility  and  faster  results 
than  he  has  using  current  methodologies.  Although  prototyping 
has  great  promise,  successful  research  in  interconnection 
formalism,  information  abstraction,  and  prototype  migration  will 
greatly  enhance  the  current  capabilities  of  prototyping. 
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ABSTRACT 


The  paper  considers  the  problem  areas  that  need  to  be  investigated  to  develop  a  new  real-time 
system  design  paradigm  for  the  emerging  applications  and  computing  environment.  The 
envisioned  environment  will  use  the  so  called  intelligent  networks  comprised  of  processors, 
storage,  communications  and  input/output  resources  and  a  methodology  for  their  management. 
It  will  be  shared  by  multiple  applications  which  will  be  guaranteed  to  be  executed  with  specified 
timing  and  reliability.  These  applications  and  the  computing  environment  will  be  so  extensive 
that  use  of  automatic  tools  that  involve  simulation  and  optimization  of  allocation  of  resources 
will  be  imperative.  While  much  research  has  been  conducted  on  system  specification  languages, 
far  less  effort  has  been  expanded  on  how  the  systems  specified  using  these  languages  can  be 
evaluated  and  analyzed.  While  basic  techniques  have  been  developed  in  the  respective  areas,  it  is 
necessary  to  develop  a  consistent  and  practical  set  of  models  of  the  entities  in  the  application  and 
environment  and  the  network  management  methodology.  These  models  will  then  have  to  be 
used  practically  in  simulating  and  optimizing  the  respective  designs.  The  collection  of  these 
models  will  in  fact  constitute  the  simulation  and  optimization  tools.  The  paper  focuses  on  the 
following:  modelling  of  concurrency  methods;  Simulation  models  for  application  entities, 
simulation  models  for  intelligent  network  entities,  simulation  of  application  timing,  simulation 
models  of  recovery,  and  optimization  models 

Basic  techniques  have  been  proposed  in  the  above  areas.  It  is  necessary  to  integrate  them  in  an 
overall  practical  simulation  and  optimization  guided  by  human  design.  These  models  will 
provide  also  a  consistent  framework  for  formal  definitions  of  the  samantics  of  the  application 
and  computing  environment  entities  used  in  specification  languages,  including  the  interactions 
between  the  entities. 

1.  INTRODUCTION 

The  emerging  computing  environment  is  envisaged  as  consisting  of  high-performance 
computing  resources  connected  through  a  high-speed  communications  network  which  execute 
diverse  applications  in  a  timely  and  reliable  manner.  The  effective  exploitation  of  the  full 
potential  of  such  an  environment  will  require  an  advanced  software/hardware  design  paradigm 
based  on  sound  methodologies,  use  of  computer-aided  tools,  and  education  of  the  designers  in 
the  new  paradigm.  [52] 

The  key  aspect  of  the  paradigm  is  providing  throughout  the  system’s  life  cycle  simulation  and 
optimization  support  for  intermediate  designs,  and  interactive  man-machine  redesign  by 
feedback  of  the  results.  The  paper  discusses  the  problem  areas  in  developing  this  paradigm. 

The  paradigm  focus  is  on  real-time  geographically  distributed  applications.  Typical  applications 
are  manufacturing  systems,  monitoring  systems,  weapons  systems  and  intelligent 
communication  systems.  Currently,  most  of  these  applications  use  dedicated  networks. 
However,  in  the  envisaged  environment,  these  networks  will  be  shared  by  many  applications. 
Examples  are  provided  in  the  intelligent  networks  envisioned  by  commercial  telephone 
companie.s.  [1,45]  They  will  support  many  virtual  networks  with  guaranteed  performance  and 
reliability,  each  virtual  network  supporting  iLs  own  time-critical  application. 
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The  potential  high  cost  associated  with  an  incorrect  operation  of  these  systems  demands  a 
rigorous  framework  in  which  design  alternatives  can  be  postulated,  optimized,  and  analyzed 
during  design  as  well  as  during  their  operations  (to  react  to  dynamic  changes  in  the  computing 
environment).  These  systems  are  costly  to  prototype  as  they  change  dramatically.  Thus  the 
timing  properties  of  design  alternatives  must  be  carefully  evaluated  under  varying  conditions. 
The  design  of  these  systems  is  becoming  increasingly  difficult  because  more  functionalities  are 
expected  and  more  distributed  components  are  networked  together.  Because  of  this,  the  role  of 
the  human  designer  is  limited  to  providing  guidance  in  exploring  alternatives  that  can  be 
expressed  on  the  high  level  of  system  architecture.  Because  of  the  complexity  of  the 
applications,  design  alternatives  can  only  be  effectively  evaluated  and  optimized  automatically 
through  extensive  simulation. 

Three  major  criteria  must  be  applied  to  the  methods  that  embody  the  paradigm: 

a)  The  man-machine  interactions  required  to  guide  the  design  and  propose  alternatives. 

b)  The  evaluation  of  system  entities  and  their  interactions  through  simulation. 

c)  The  optimization  of  system  architecture  and  resource  allocation  on  a 
dynamic  basis. 

There  has  been  much  research  on  specifying  the  applications.  As  discussed  below,  we  accept  a 
specification  language  as  a  given.  Our  objective  is  to  outline  the  evaluation  and  analysis  models 
for  applications  specified  in  these  languages  which  are  executed  using  an  intelligent  network. 
The  consistent  collection  of  these  models  will  constitute  the  practical  simulation  and 
optimization  tools  used  in  the  design  paradigm. 

The  rest  of  the  paper  is  organized  as  follows.  Section  2  provides  a  further  overview  of  the 
future  operating  environment  and  use  of  the  design  paradigm.  Section  3  identifies  necess^ 
research  and  development  areas  needed  in  the  models  incorporated  in  design  tools.  It  explains 
what  needs  to  be  done  in  each  of  these  areas.  Section  4  concludes  the  paper. 

2.  FUTURE  IMPLEMENTATION  ENVIRONMENT  AND  USE  OF  DESIGN 
PARADIGM 

Based  on  current  technological  advances  and  customer  expectations,  it  is  not  difficult  to 
imagine  a  future  high-speed  intelligent  network  that  simultaneously  supports  many  real-time 
applications.  An  example  is  the  intelligent  telephone  network  that  will  provide  world-wide 
distributed  computing  services  in  real-time.  [1,45]  There  are  many  factors  that  affect  the 
performance  of  such  systems:  the  hardware  resources  of  the  distributed  network,  the  software 
structure  of  the  distributed  application,  the  mapping  of  the  software  components  to  the  hardware 
resources,  and  the  reconfiguration  capability  for  adapting  to  dynamically  changing  conditions. 

Because  of  complexity,  a  time  critical  system  must  be  developed  following  a  design 
methodology  that  supports  an  iterative  and  top-down  development.  This  design  methodology 
must  be  based  on  a  framework  that  supports  the  progressive  re-architecture  of  system 
components  and  the  comparative  evaluation  of  intermediate  designs.  For  example,  the  designer 
may  describe  the  system  as  consisting  of  a  relatively  few  communicating  components  whose 
internal  details  are  hidden.  The  designer  then  uses  the  tools  to  evaluate  and  optimize  this  design 
assuming  the  existence  of  some  hardware  resources.  Based  on  the  evaluation,  the  designer  may 
change  its  design  or  refine  it  by  detailing  the  internals  of  some  or  all  components.  This  will  be 
continued  until  the  design  is  detailed  enough  to  be  subjected  to  correctness  verification.  During 
refinement,  some  of  earlier  design  decisions  may  also  have  to  be  incrementally  revised.  Above 
all,  much  of  this  design  methodology  must  be  supported  by  computer-aided  tools  that  can 
automatically  find  optimal  mappings  between  software  components  to  hardware  resources  and 
simulate  their  performance.  The  simulation  should  also  identify  bottlenecks  and  suggest  how  to 
improve  performance. 
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3. 


PROBLEM  AREAS  IN  DEVELOPING  THE  DESIGN  METHODOLOGY 


Much  research  has  been  reported  on  specification  of  computerized  systems,  but  relatively  little 
has  been  done  on  the  three  problem  areas  staled  above:  (a)  the  user  guiding  the  automated 
design,  (b)  the  consistent  modelling  and  simulation  of  applicaiion/rcsource  entities,  and  (c) 
optimization  of  the  allocation  of  computing  resources. 

There  is  however  a  considerable  reservoir  of  basic  applicable  techniques  in  these  areas  that  need 
to  be  investigated  and  adopted,  as  appropriate.  [6,9,24,25,31,32,35,39,42,46,48] 

3.1  Specification  Methods 

The  methods  of  specification  of  computer  systems  reported  to  date  may  be  accepted  as  given. 
[2,5,8,17,20,29,31,32,49,50]  They  can  be  generally  characterized  as  follows: 

Independence  of  specification  of  the  application  from  specification  of  the  computing 
resources:  These  two  types  of  specifications  are  composablc  separately  and  independently. 
There  is,  however,  a  third  specification  that  lies  application  entities  with  computing 
resources-that  of  timing  aspects  of  the  application  entities  when  using  available  computing 
resource  architectures. 

Graphic  representation  of  a  specification:  A  specification  can  be  represented  as  an 
Entity-Relation  graph,  where  entities  are  nodes  and  relations  are  edges.  A  variety  of  attributes 
may  be  associated  with  the  nodes  and  edges  as  follows: 

Application  specification:  typically  consists  of  objects  and  transformation  entities 
(nodes)  and  data  flow  or  object  oriented  relations  (edges).  [13,19]  Also,  it  may  include 
as  attributes  throughput  execution  limes,  deadlines  and  response  times,  geographical 
constraints,  as  well  as  reliability  and  recovery. 

Computing  Resources  specification:  typically  consists  of  processing,  storage,  and 
communication  entities  (nodes),  and  interconnection  relations  (edges).  Also,  it  includes 
allribuies  of  processing  capacities  and  dynamic  network  management  and  scheduling 
stralegies.[3,4,7,46] 

Hierarchically  structured  entities:  The  applications  and  computing  resources  considered  here 
will  typically  be  very  extensive.  For  a  human  to  provide  design  guidance,  it  will  be  necessary  to 
create  entities  on  multiple  levels  of  detail,  each  representing  an  abstraction  of  the  lower  leve’ 
entities.  The  design  may  be  performed  on  a  selected  level  of  detail.  Design  based  on  a  higher 
level  is  easier  to  specify,  evaluate  and  verify.  However,  it  hides  lower  level  details.  Working 
with  a  design  based  on  higher  level  entities  incurs  the  penalty  of  reduced  precision,  as  well  as 
reduced  possibilities  for  optimization.  Based  on  feedback  from  the  design  process,  the  user  may 
have  to  modify  the  hierarchical  structure  by  merging  or  subdividing  entities.  [40,52] 

While  specification  languages  vary,  it  is  assumed  that  they  may  be  transformed  into  a  form  that 
uses  the  models  of  various  entities  and  relations  discussed  below.  The  amalgamation  of  these 
models  will  be  the  simulation  and  optimization  tools  that  will  be  used  in  the  design  paradigm. 

3.2  Design  Areas  That  Need  to  be  Developed 

Six  design  areas  are  outline  below.  The  slate  of  the  an  of  relevant  techniques  must  be 
investigated  and,  if  appropriate,  incorporated  into  the  simulation  and  optimization  tools  used  in 
the  design  paradigm.  They  need  to  be  evaluated  based  on  their  imoact  on:  (a)  user  interactions, 
(b)  modelling  of  entities  for  the  simulation  and,  (c)  the  optimization  of  allocation  of  resources. 
Further,  there  is  a  need  to  verify  the  method  in  test  applications  to  ascertain  its  practicality.  The 
design  areas  are  as  follows: 

MODELING  OF  CONCURRENCY  METHODS:  Concurrency  is  the  prime  method  used  to 
parallelize  operations  in  order  to  satisfy  real  time  requirements.  A  number  of  techniques  are 


511 


available  for  synchronizing  and  scheduling  concurrent  distributed  ptiKCs-ses  127  >?,4i|  lu 
guide  the  design,  the  user  must  understand  the  respective  methods  Methods  based  on  high  level 
entities  (such  as  those  based  on  messages)  may  be  pielerred.  Ixxause*  of  case  of  human 
interaction  and  venficaiion,  upon  methods  based  on  hm  level  eituues  tsuch  as  those  based 
semaphores).  The  selected  concurrency  methods  must  be  use*d  i-o  drive  ihe  simulatoi  IVv  also 
impact  the  search  for  optimal  alfiKalion  ol  resources  and  ihe  need  by  the  uset  lo  nuadiiy  the 
architeclurc  of  the  application  based  on  results  o!  ihe  simutauons  Ihus,  the  st  lectsoti  o! 
concurrency  method  is  critical  as  it  cuLs  across  live  entire  design  paradigm 

SIMULATION  MODELS  OE  APPLICATION  ENTITIES  These  models  actually  represem  the 
semantics  of  the  entities  expressed  in  the  specification  language  ti)  s(x*cify  the  apphcation  I  he 
u.ser  must  understand  these  semantics  in  ordci  to  control  the  progress  i>!  the  design  Die  nuKleis 
must  be  defined  for  all  the  software  obiecus  that  ciirresjHind  to  appheatum  eniiucs  and  lor  ilie 
inieraciions  between  these  objects.  Some  cniiiics  may  be  .specihed  on  a  higher  level  than  (hat  u! 
.software  objects  and  the  selection  of  how  they  are  to  K*  implemented  may  Iv  tel!  to  a  later  stage 
As  noted  entities  may  be  hierarchically  structured  The  exploding  or  imploding  of  enuiies  must 
follow  modelling  rules  as  well  115.36.4;(i 

SIMULATION  MODELS  OF  COMPUITNG  RESOURCES  AND  COMMUNlCAriON 
ENTITIES;  The.se  models  must  be  defined  for  all  the  types  of  en'.uies  that  are  used;  procc.ssots. 
storage,  communications  and  inpul/ouipui  devices.  Separate  modeLs  are  needed  ft>r  cniiues  with 
different  architectures.  Resource  entities  may  also  be  hierarchically  structured.  A  mcthcxJdlogy 
is  needed  for  exploding  or  impUxling  resource  cniiiics  in  a  multiple  level  structure.  In  addition, 
alternative  management  meih^s  of  the  computing  cnvirtinmcni  must  he  modelled,  including  the 
scheduling  and  recovery  algorithms.  These  environment  management  methods  will  be  used  to 
drive  the  simulation  as  well. 

SIMULATION  MODELS  FOR  TIMING  OF  APPLICATION  ENTniES.  These  are  models  for 
evaluating  the  delay  involved  in  operations  of  an  application  eniiiy.  when  using  a  feasible 
resource  architecture.  A  delay  may  vary  depending  on  the  type  of  operation  and  on  the  stale  of 
the  entity.  There  are  a  number  of  methods  for  specifying  (and  simulating)  liming  of  entities 
They  vary  in  the  degree  of  complexity  in  specifying  the  delays  and  the  simulation  time  required 
to  determine  the  delays. 

SIMULATION  MODELS  FOR  RECOVERY:  These  arc  models  of  the  computing  environment 
management  for  detecting  and  responding  to  e vents  if  rc,sources  malfunction  and  if  load 
variations  imperil  the  ability  of  the  network  to  satisfy  real  lime  requirements.  These  models 
must  incorporate  evaluation  of  recovery  lime.  The  simulation  model  for  occurrence  of  the  need 
for  recovery  must  be  included  as  well.  The  model  must  take  into  account  the  effect  of  built-in 
redundancy  in  network  resources  on  need  for  recovery'  . 

OPTIMIZATION  METHODS:  The  optimization  is  typically  used  to  allocate  resources  which 
minimize  operating  costs,  delays,  or  recovery  time,  or  some  combination  of  these.  The 
envisaged  applications  and  the  computing  and  communication  resources  are  exien.sive.  The 
allocation  space-thc  feasible  combination  of  allocated  rc.sourccs  and  application  cntitie.s-js 
immense.  Manual  resource  allocation  is  totally  impossible.  However,  the  rationale  in  the  search 
of  the  allocation  space  must  be  explainable  to  the  user,  so  that  user  may  offer  guidance  in  the 
search.  Resource  allocations  have  varying  life  times  and  stabilities  For  instance,  distribution  of 
information  in  data  bases  may  not  need  to  be  changed  for  a  long  time.  Such  allocations  may  be 
taken  as  static  and  completely  stable.  However,  the  design  must  explore  wide  classes  of  possible 
allocations  of  communication  and  processors  as  such  allocations  may  have  to  be  changed 
frequently.  Some  allocations  may  have  to  be  reoptimized  frequently  during  daily  operation,  as 
the  application  traffic  changes  and  as  resources  increase  or  decrease.  The  optimiz,ation  method 
must  be  progressive  so  that  if  time  allows  it  can  continue  in  trying  to  improve  the  allocation  of 
resources,  and  otherwi.se  it  accepts  intermediate  re.sults.  [11,16.26,27,30,34.40] 
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While  basic  techniques  in  the  above  areas  exist,  there  »s  no  sysieniaiic  structure  lor  irKurfxiraling 
and  modifying  various  types  of  models.  A  comprehensive  trarnework  must  be  constructed  m 
which  these  models  can  be  incorporated  lire  prrxluctivity  of  the  overall  simulation  and 
optimization  must  be  tested  using  realistic  examples  j  tHj  in  the  totuexi  oi  existing  workstations 
and  high  speed  processors. 

4.  CONCLUSION 

The  problem  areas  enumerated  in  this  pufx*r.  while  they  have  Ivcn  stated  in  terms  o!  then  usi*  hs 
the  tools  of  a  system  design  paradigm,  really  concern  the  dcfitmiuti  ot  basic  concepts  u,wd  m 
system  specification  and  implcmcmalion  ITey  provide  a  basis  toi  a  l«>nna!  defininon  o!  the 
semantics  of  application  and  resource  eniities  They  will  ahso  take  inti*  account  reahsiicalls  the 
human  capacity  to  guide  the  design  proce.ss  and  the  cinnpiling  revnirccs  needed  lot  design 
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Benefits  of  Using  Object-Oriented  Methodology'  for 
Missile  Guidance  Processor  Software  Development 
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Abstract 

The  purpose  of  this  paper  is  to  epitomize 
the  benefits  of  using  the  object-oriented 
methodology  to  specify  and  develop  a  real¬ 
time  system.  Object  oriented  is  a 
methodology  by  which  abstract 
implementation-oriented  classes  are 
developed.  Classes  denote  groups  of 
system  objects  that  share  common 
characteristics  and  behavior.  The.se  classes 
are  further  specified  by  mapping  data 
"attributes"  and  procedural  "methods"  to 
particular  class  "instances".  Finally,  the 
class  interfaces  or  "message”  structures  are 
defined.  It  has  been  discovered  that  the 
object-oriented  approach  can  reduce 
development  time  of  software  systems  (!]. 
This  paper  will  show  the  object-oriented 
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methodology  in  specifying  an  existing  case 
study  [3).  This  case  study  is  of  a  real-time 
software  system  designated  the  VGPS 
which  implements  a  missile  guidance 
processor  control  system. 

INTRODUCTION 

In  this  paper  the  main  system  to  be 
examined  is  the  Operational  Flight  Program 
(OFP)  [3].  The  OFP  is  the  sing’c  system 
user  that  has  three  primary  processes;  1) 
Autopilot,  2)  Guidance,  and  3)  Gain 
Computer.  The  Operational  Flight 
Program  interfaces  are  identified  in  Figure 
1.  The  following  are  commonly  used 
abbreviations  [3]: 

FCS;  FIRE  CONTROL  SYSTEM 
T/M:  TELEMETRY 
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IRU;  INERTIAL  REFERENCE 
UNIT 

R/T:  RECEIVER  /  TRANSPONDER 
CAS:  CONTROL  ACTUATION 
SYSTEM 

S/A:  SAFE  AND  ARM 
GPU:  GUIDANCE  PROCESSOR 
UNIT 

P/F:  PROXIMITY  FUZE 


E/T  } 


IfJ 


Figure  1  -  VGPS  INTERFACE  BLOCK  DIAGRAM 


In  Object-Oriented  (00)  methodology  a 
system  is  a  group  of  objects  that 
communicate  among  themselves.  This 
communication  can  be  understood  by 
analyzing  the  relationships  that  exist 
between  objects,  the  behavior  of  each 
individual  object,  and  the  mutual  behavior 
of  cooperating  objects. 


Objects  themselves  may  also  be  considered 
systems.  For  benefit  of  this  case  study,  the 
OFP  is  a  system  of  smaller  objects  that 
interact  to  produce  the  behavior  associated 
with  the  object  “OFP“.  An  (X) 
specification  attempts  to  modularize  a 
system  along  the  same  object  boundaries 
that  exist  within  the  real-world.  This  leads 
to  a  one-to-one  correspondence  between  the 
object  as  it  exists  in  the  real-world  and  the 
object  as  it  exists  in  a  specification  Herein 
lie  one  of  the  principal  benefits  to  be 
realized  using  an  object-oriented 
specification.  In  addition  to  simplifying  the 
conceptual  model  of  an  object’s 
specification  it  also  serves  to  organize  all 
knowledge  about  each  object  in  a  single 
logical  location. 

Because  of  the  emphasis  on  objects  and  the 
corresponding  de-emphasis  on  processes, 
the  OO  approach  to  system  specification 
encourages  the  specifying  of  logical 
systems  rather  than  physical  systems.  The 
necessary  functional  properties  of  a  system 
may  easily  be  expressed,  but  these  take 
priority  over  the  specification  of  system 
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objects  and  their  relationships  and 
interactions.  00  techniques  also  provide 
forms  of  conceptual  abstraction  not  found 
in  other  techniques  that  rely  heavily  on  the 
physical  approach. 

In  addition  to  aggregation  it  provides 
generalizations  and  classification.  These 
allow  it  to  easier  specify  the  informational 
processing  with  respect  to  objects.  Most 
importantly  the  object  methodology 
implicitly  leads  to  the  construction  of 
systems  that  embody  the  five  attributes  of 
well-structured  complex  systems; 

1)  Complexity  takes  the  form  of  a 
hierarchy  where  every  complex 
system  is  composed  of  related  sub¬ 
systems  who  themselves  are 
composed  of  subsystems  until  the 
elementary  components  are  reached 
[2]. 

2)  The  choice  of  what  constitutes  an 
elementary  component  is  arbitrary  and 
is  totally  dependent  on  the  observer  of 
the  system, 


3)  ‘■Iniracornponent  linkages  are 
generally  stronger  than 
intercomponent  linkages.  This  fact 
has  the  effect  of  separating  the  high- 
frequency  dynamics  of  the 
components  -  involving  the  internal 
structure  of  the  components  -  from 
the  low-frequency  dynamics  - 
involving  interaction  among 
components”  (2)  this  leads  to  a 
separation  of  concerns  allowing  the 
observation  of  various  parts  of  the 
system  to  be  encapsulated. 

4)  "Hierarchic  systems  are  u.sually 
composed  of  only  a  few  different 
kinds  of  subsystems  in  various 
combinations  and  arrangements”  {2], 

5)  "A  complex  system  that  works  is 
invariably  found  to  have  evolved  from 
a  simple  system  that  worked..”  [5] 
that  is  to  say  that  as  systems  evolve, 
objects  that  were  once  considered 
complex  now  become  the  elementary 
objects  upon  which  more  complex 
systems  are  constructed. 

The  following  benefits  of  applying  the 
object  methodology  have  been  discovered: 
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(1)  Lower  development  risk. 

(2)  Allows  systems  to  be  developed  that 
are  more  adaptable. 

(3)  Modem  software  practices  are 
implicitly  involved. 

(4)  Allows  system  development  to  be 
more  natural  in  it's  approach. 

LOWER  DEVELOPMENT  RISK 
The  risk  of  developing  systems  with  the 
object  methodology  is  lowered  for  3 
reason(s).  First,  integration  is  spread  out 
across  the  entire  system  life  cycle.  Second, 
during  the  object  methodology  design  phase 
an  intelligent  separation  of  concerns 
reduces  development  risk  and  also  inherits 
the  benefits  of  incremental  development. 
Lastly,  due  to  separation  of  concerns  a 
complex  system  can  be  readily  identified 
for  its  correctness.  In  this  case  study  a  real¬ 
time  system  was  needed  to  be  developed  in 
a  timely  manner  that  would  be  totally 
operable  with  little  maintenance. 

The  distribution  of  system  integration  and 
separation  of  concerns  has  the  benefit  of 
minimizing  the  impact  of  inevitable  system 


changes.  The  object  classes  developed  in 
earlier  phases  of  the  system  life  cycle  allow 
the  introduction  of  changes  lower  in  the 
hierarchy  to  be  made  with  no  effect  to  the 
higher  level  objects.  This  level  of 
abstraction  is  a  natural  benefit  of 
developing  objects  from  inheriting  methods 
from  subclasses. 

As  in  the  original  case  study  (3]  the  careful 
design  of  object  classes  allowed  the  full 
advantage  of  incremental  development  to  be 
realized.  This  allows  early  functionality 
and  evaluation  of  system  capability  which 
in  the  development  of  a  complex 
hierarchical  system  leads  to  a  reduction  in 
the  time  needed  for  integration,  and  allows 
system  evaluation  to  occur  at  an  earlier 
phase  of  system  development.  A  side 
benefit  of  early  system  evaluation  is  a 
capital  savings  of  several  magnitudes  as  it 
is  well  known  that  the  earlier  a  system 
modification  is  made  in  the  system 
development  life  cycle  the  less  costly  that 
modification  is  to  administer. 
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ADAPTABILITY 


In  approaching  the  topic  of  adaptability  it 
must  be  remembered  that  several  layers  of 
abstraction  were  used  in  the  original  case 
study  design  to  facilitate  this  feature. 
Therefore  it  can  be  seen  that  by  providing 
"sufficient"  level  of  abstraction  a  system 
can  be  designed  to  be  adaptable  in  the  sense 
that  the  overall  functionality  will  not 
change  drastically  (i.e  If  that  was  the  case 
then  design  a  new  system)  therefore  the 
essential  system  will  be  stable  and  the 
"how"  to  implement  this  essential  system 
will  change  to  a  measurable  degree.  A 
"properly"  designed  00  system  will  exhibit 
adaptability  in  that  it's  superclasses  will  be 
affected  very  minimally  due  to  changes  at 
the  base  class  level.  In  the  cases  where  this 
is  not  immediately  apparent  then  another 
level  of  abstraction  need  be  added  to 
insulate  the  superclasses  from  the  base 
classes.  00  methodology  supports 
adaptability  through  the  mechanisms  of 
abstraction,  encapsulation,  modularity,  and 
hierarchy. 

Abstraction  is  one  of  the  principles  that  we 
use  to  cope  with  complexity.  Hoare 


suggests  that  "abstraction  arises  from  a 
recognition  of  similarities  between  certain 
objects,  situations,  or  processes  in  the  real 
world,  and  the  decision  to  concentrate  upon 
these  similarities  and  to  ignore  for  the  time 
being  the  differences"  [6].  Through  the 
development  of  objects  (  Hardware/ 
Software  )  from  base  classes  that  were 
analyzed  for  correctness  and  functionality 
the  VGPS  system  implementation  utilized 
information  hiding/method  hiding.  This 
allows  the  higher  level  objects  to  be 
"insulated"  from  the  changes  made  to  lower 
level  objects.  The  advantage  to  system 
development  is  that  now  unimplemented 
objects  can  be  readily  accessed  by  using 
stub  functionality. 

Encapsulation  is  the  principle  by  which  no 
part  of  a  complex  system  should  be 
dependent  on  the  internal  implementation 
of  another  part  of  the  system.  Whereas 
abstraction  helps  one  conceptually  model  a 
system  to  the  desired  level  of  focus 
encapsulation  "allows  program  changes  to 
be  reliably  made  with  limited  effort"  [7]. 
In  the  initial  class  designs  for  the  VGPS  it 
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was  decided  that  the  elementary  hardware 
dependent  components  should  be 
encapsulated  to  allow  the  higher  level 
classes  to  function  relatively  independent  of 
the  underlying  hardware  system.  Similar  to 
the  original  case  study  this  also  had  the 
benefit  of  allowing  a  consistent  interface  to 
several  system  support  objects.  This 
allowed  the  design  and  construction  of  high 
level  objects  that  did  not  need  modification 
as  hardware  changes  were  instituted  and  the 
fully  operational  Operational  Test  Program 
implemented  in  final  form. 

Modularity  is  the  principle  by  which  a 
sy.stem  is  partitioned  into  individual  well- 
defined  components.  This  allows  easier 
comprehension  and  reduces  the  complexity 
of  a  system.  Modules  serve  as  physical 
containers  in  which  we  implement  the 
classes  and  objects  of  our  logical  design. 
Therefore  it  can  be  seen  that  the  principles 
of  abstraction,  encapsulation,  and 
modularity  are  cooperative  in  their 
functionality  as  far  as  the  object 
methodology  is  concerned.  That  is  an 
object  provides  a  well-defined  boundary  for 


a  single  abstraction,  and  encapsulation  and 
modularity  provide  the  necessary  barriers 
around  the  object.  The  application  of 
modularity  in  the  VGPS  system  was  the 
natural  result  of  designing  meaningful  base 
classes.  The  principles  of  encapsulation 
and  separation  of  concerns  that  were 
utilized  in  the  base  class  design  for  the 
construction  of  meaningful  base  classes 
naturally  lead  to  the  development  of 
modular  software.  Furthermore  the 
principles  of  encapsulation  and  separation 
of  concerns  lead  to  increased  cohesion  and 
a  decreased  coupling. 

Hierarchy  is  the  "..ranking  or  ordering  of 
abstractions"  [4].  This  was  accomplished 
in  the  VGPS  by  the  successive  use  of  base 
classes  to  form  objects  which  in  turn  were 
the  base  "classes"  for  more  complex 
objects.  The  hierarchy  allowed  the 
development  of  a  system  that  was  hardware 
independent  and  very  adaptable.  The 
hardware  independence  was  realized 
through  the  careful  design  of  the  interface 
between  hardware  dependent  base  classes 
and  their  respective  superclasses. 
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Superclasses  represent  a  more  general 
abstraction  and  therefore  as  objects  were 
built  with  these  superclasses  those  objects 
became  dependent  on  the  functionality  of 
the  superclasses  themselves  and  not  the 
underlying  implementation.  Subsequently 
this  would  allow  the  complete  development 
of  those  objects  that  interfaced  with  a 
particular  superclass  even  though  the 
procedural  methods  of  that  superclass  were 
potentially  volatile. 

MODERN  SOFTWARE  TECHNIQUES 
The  modern  software  engineering 
techniques  that  were  used  in  the 
construction  of  the  VGPS  as  in  the  original 
case  study  include  information  hiding, 
separation  of  concerns,  modularity, 
adaptability  and  incremental  development. 
Again  as  stated  in  the  original  case  study 
the  application  of  these  techniques  has  not 
been  studied  extensively  in  real-time 
systems  development  therefore  it  is  not  our 
theoretical  claim  to  identify  the  advantages 
and  disadvantages  of  their  use  in  system 
development. 


However,  because  the  object  methodology 
supports  most  of  these  techniques  implicitly 
it  can  be  shown  that  the  overall 
construction  of  a  system  using  this 
methodology  will  as  a  rule  take  less  time  to 
develop  and  maintain.  This  is  primarily 
true  because  the  object  methodology 
stresses  the  essential  specification  of  a 
system  during  the  base  class  design  phase. 
The  formation  of  superclasses  from  base 
classes  further  facilitates  the  development 
of  systems  that  are  stable  and  correct 
because  the  designer  can  comprehend  more 
of  the  system's  functionality  and  therefore 
can  at  once  see  the  correctness  or 
impropriety  of  a  particular  relationship  or 
interface.  Furthermore,  the  resulting  class 
re-use  from  a  stable  working  system  (i.e. 
which  is  a  byproduct  of  the  object 
methodology)  leads  to  the  formation  of 
even  more  complex  systems  that  are  correct 
and  stable  with  little  effort.  Maintaining 
and  enhancing  systems  developed  with  the 
object  methodology  tend  to  be  easier  due  to 
the  incorporation  of  separation  of  concerns, 
encapsulation,  and  modularity.  That  is  a 
particular  enhancement  ca  be  made  to  a 
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advantages  garnered  from  designing/ 
implementing  via  the  object  methodology 
versus  designing/  implementing  with  a 
methodology  and  adding  in  these  "extras". 
As  stated  before  the  use  of  these  modern 
software  techniques  have  not  been  fi  lly 
documented  as  far  as  the  respective  benefits 
and  advantages  to  be  realized  from  their 
use.  But  it  is  not  a  far  leap  of  faith  to 
justify  that  a  system  designed/implemented 
with  these  tools  will  allow  the  system 
designer  to  faster  implement  systems  that 
are  more  reliable. 
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Abstract:  Requirements  methods  proven  pracucal  on 
large  embedded  computer  systems  (EiCS)  arc 
fcmmalized,  symhesired.  ant!  improved.  A  cross-sccuon 
of  methods  arc  -valuated  for  robust  semantics, 
mathemaiica)  fc  iation,  capability  for  analysts  and 
verification,  and  support  for  model  consuuciton. 
comprehension,  reuse,  and  modification.  A  synthesued 
Pragmatic  Formal  Method  (PFM)  augments  the  best 
characteristics  of  current  mcthod.s.  Some  of  the 
capabilities  defined  for  PFM  could  be  added  to  current 
methods,  supporting  analysis  for  cnors  of  omission  and 
commi.s.sion,  and  facilitating  system  simulation  and 
early  prototyping. 

FXS  ENGINEERING 

An  embedded  computer  system  (BCS)  is  pan  of  a 
physical  system  and  must  perform  quickly  and  correctly 
for  the  physical  system  to  perform  its  intended  function 
These  systems  interact  with  their  cnvironmcni  (people, 
hardware  devices,  and  software  in  other  systems),  and 
are  composed  of  subsystems  that  may  themselves  be 
composed  of  smaller  subsystems  interacting  with  each 
other  and  their  environment,  sec  Fig.  1.  Each 
subsystem  usually  contains  both  hardware  devices  and 
software.  The  hardware  devices  may  be  generic 
processors  or  displays,  or  may  be  functionally 
specialized  for  an  application,  (c.g.,  radars, 
communication,  and  navigation  devices).  Each 
hardware  device  may  also  contain  softwaru.  The 
software  and  hardware  interact  with  each  other  and  with 
the  environment. 

Systems  Engineering:  For  embedded  systems,  system 
requirements  and  design  must  be  performed  before 
software  requirements  can  be  specified.  Systems 
engineering  is  engineering  from  a  total  system,  rather 
than  from  a  component  viewpoint.  The  system 
engineer  analyzes  end-to-end  critical  processing 
threads,  including  analog  {»rts  that  affect  performance 
and  accuracy,  to  develop  a  cost-effective,  feasible 
design  that  meets  mission  requirements.  Hc/shc 
allocates  system  requirements  to  hardware,  software, 
firmware,  communication  links,  and  people;  and  then 
allocates  software  requirements  to  disuibuted 
subsystems  and  disuibuted  databases.  A  systems 
engineer  must  have  a  strong  software  engineering 


background  to  make  dctiiions  ihai  avoid  uvltwarc 
development  problems 

Systems  that  process  and  toouol  infiMmautm  are 
usually  comple*.  many  of  these  systems  are 
un^ccdcnied  and  rtquucmcnu  change  frequently 
Discipline,  clarity,  and  automauon  tat  iseedcd  to  iesign 
such  systems  System  cngifscen  should  be  cocKemed 
with  Uarce  fiasdamcntal  corKCpts  (TRUX  72  ) 

•  Modeling  a  quantitive  description  of  system 
opcraljon 

•  Dynamics  -  the  change  m  the  system  with 
icspcti  to  umc 

•  Optimization  choree  of  a  good  design  based  o« 
itbtjvc  w-cighu  assigned  us  system  aspects 

Requirements  Models:  ECS  arc  pnmanly  control 
oncnicd  and  require  different  techniques  for  system 
development  than  iransaciion-based  systems 
Embedded  systems  mu«  respond  quickly  and  cnrnxtiy 
to  complex  sequences  of  unpredictable  cxicmaJ  events 
The  system  response  vanes  according  to  the  order  of 
these  events  Con.scqucnily.  the  accurate  dc-scnplion 
and  analysis  of  the  desired  system  reaction  to  sequences 
of  external  stimuli  is  a  very  important  step  in  the 
specificauon  of  embedded  systems  The  model  that 
embodies  hierarchy  and  dataflow,  which  is  sofficicni 
for  tran.sacuon -based  systems,  is  insufficient  for  real 
umc  systems.  Requirements  must  include  the  dy-nanic 
view  that  .specifics  ch^ges  in  .system  state  caused  by 
cxiemaJ  events 

The  boundaries  of  the  system  must  also  he 
explicitly  defined,  and  the  environment  of  the  syslwn 
.Tiost  be  modeled  and  arulyzcd  as  part  of  the  sy.stcni 
requirements  development  process.  The  model  should 
include  the  time-dependent,  deterministic  iikJ  non 
dctcrmini.stic,  panllcl  and  serial  nature  of  the  inixiis. 

Nonfunctional  requirements  such  as  pcrfcwmance, 
fault  detection  and  recovery,  safety,  security, 
availability,  reliability,  and  ease  of  change  are  mapr 
concerns  in  complex  embedded  computer  systems. 
They  must  be  addressed  disring  the  generation  of  system 
and  software  requirements. 

Mathematical  Formalisms:  ECS  we  complex  and  a 
single  mathematical  formalism  is  insufficient  to  define 
all  system  aspects.  It  is  ncccs.sary  to  find  a  .set  of 
formalisms  that  provides  a  minimal  cover  of  those 
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aspects  necessary  for  system  description.  Quanuiauve 
models  based  on  mathematicai  formalisms  arc  needed 
to  support  tradeoff  analysis,  and  the  development  of 
quality  designs.  Mathematical  formalisms  are  requued 
because  of  the  size  of  the  problem.  Mathematics 
provides  the  discipline,  hgor,  and  masoning  ability  Uuu 
arc  needed  to  solve  complcz  fwoblcms, 

METHOD  EVALUATION 

Grumman  has  ^nsored  research  in  requu-crncnis 
definition  methods  from  the  laic  70‘s  through  the 
present  (BARI  79,  WHIT  81.  WHIT  85a.  WHIT  85  b. 
WHIT  86.  WHIT  87a.  WHIT  87b.  WHIT  92a.  WHIT 
92b).  Grumman  also  performed  research  in 
requirements  definition  under  contract  to  Naval  Air 
Development  Center  (WHIT  80),  and  Naval  Research 
Uboratory  (WHIT  83.  NASH  84.  KMIE  84). 

Methods  Evaluated:  In  (WHIT  87b),  wc  examined 
eight  methods  for  modeling  reactive  systems  to 
determine  the  formalisms  used,  and  to  evaluate  method 
capability  for  supporting  embedded  system 
specification.  These  methods  have  not  changed 
appreciably  since  1987.  New  object-oriented  analysis 
methods  (COAD  91.  SHLA  88)  use  an  entity- 
relationship  approach  to  information  modeling.  These 
methods  provide  an  important  view,  but  arc  not 
complete.  They  do  not  address  the  scenario  or  control 
aspects  of  ECS  requirements. 

The  methods  we  examined  are  the  semi- 
formal/formal  methods  most  commonly  used  in 
industry  for  modeling  complex  systems.  They  arc; 

•  Distributed  Computing  Design  System,  a 
function  flow  and  dataflow  method 

•  Structured  Analysis  for  Real-Time  Systems,  (SA- 
RT),  a  dataflow  and  state-machine  method 

•  Higher  Order  Software  (HOS),  a  function 
composition  method 

•  Jackson  System  Development  (JSD),  a  data 
(event)  structure  method 

•  Software  Cost  Reduction  (SCR),  a  method  based 
on  cooperating  sequential  processes  and  event- 
action  causality 

•  PAlSLcy,  a  method  based  on  cooperating 
sequential  processes  and  functional  programming 

•  STATEMATE,  a  method  based  on  cooperating 
sequential  processes  and  state  decomposition 

•  PAMELA,  an  objecl-orienicd  requirements 
analysis  method. 

The  mathematical  basis  of  these  methods  are;  input 
to  output  mapping,  algebra  (data  abstraction),  process 
abs'/action,  function  composition/  decomposition, 
sequential  machines,  concurrent  machines  (cooperating 
sequential  processes),  extended  abstract  machines^,  and 
pn^icate  logic. 


Techniques  fur  Method  Evaluatkio:  'Die  fuUowmg 
techniques  were  used  to  compare  mcilkods 

•  Each  method  was  used  to  specify  the 
requircmenu  for  an  embedded  system  A  Ikmiic  heaung 
system  was  used  for  benchmark  comparison  Based  on 
this  exercise  and  the  lueraturc.  wc  evaluated  each 
method  with  respect  to  thirteen  desirable  method 
characteristics. 

•  Wc  compared  each  method’s  mcxlel  with  a 
generic  model.  A  formal  Eniity-Rclaiionship  (ER) 
model  (CHEN  76)  was  used  to  com|»yre  the  expressive 
power  of  methods.  Expressive  power  is  imponani  as 
statements  that  carmoi  cxprci^ed  in  a  model  will  be 
omitted  from  the  analysis  and  resulting  spocirtcauon. 

•  Methods  were  analyzed  to  determine  whether 
they  use  a  partial  order  temporal  approach,  specifying 
maximal  concurrency  and  nondeterminism.  Alt 
methods  specify  some  concurrency  but  most  only 
specify  deterministic  flow.  Within  processes,  most 
methods  use  a  linear  or  branching  order  concept. 
Linear  methods  are  the  least  powerful  and  parual 
approaches  arc  the  most  powerful  (PNUE  86). 

Formal  Characterization  of  Method  Models: 
Experimentation  with  method  models  has  shown  that 
they  can  all  be  considered  as  special  cases  of  the  ER 
mo^l  (TEIC  80). 

Formalizing  system  dcfiniiion  methods  using  ER 
models  has  several  advantages; 

•  The  method  objecis  and  relationships  are  precisely 
defined. 

•  The  requirements  knowledge  captured  by  the 
methods  can  be  stored  in  a  databa^  for  analysis 
and  retrieval. 

•  The  expressive  power  of  method  languages  can  be 
compart. 

Eniitv-RelationshiD  Representation  of  Generic  Real- 
Time  Model;  The  Entity-Relationship  model  in  Fig.  2 
is  used  as  a  baseline  to  analyze  a  method’s  underlying 
model.  The  generic  model  is  a  simplified  one,  but 
includes  environment  as  well  as  system  functions, 
which  illustrates  whether  the  method  addresses 
environment  modeling.  The  generic  model  also 
decomposes  state,  function,  event,  and  data.  This  is 
used  to  demonstrate  that  the  evaluated  methods 
decompose  either  event  or  state,  but  not  both  (e.g.,  JSD 
decomposes  events  while  STATEMATE  decomposes 
state).  Decomposition  of  both  state  and  event  would 
provide  more  power  for  tracing  system  requirements  to 
software  requirements. 


^An  extended  abstract  machine  is  an  abstract  state 
machine  that  permits  condition  ’’guards"  to  be 
associated  with  events.  This  makes  the  specification 
more  succinct  as  the  machine  has  knowledge  about 
states  of  other  machines,  and  every  event  sequence  does 
not  have  to  be  specified. 
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in  the  illustration,  rectangles  represent  objects; 
circular  nodes  represent  relationships;  the  nodes  are 
numbered  for  identification.  The  relationship  holds  in 
both  directions;  the  relationship  name  has  been 
provided  for  only  one  direction  to  keep  the  diagrani 
simple.  For  example,  relationship  (1)  is  read 
ENVIRONMENT  contains  FUNCTION,  but  the 
complementary  relationship  FUNCTION  contained  in 
ENVIRONMENT  also  holds. 

In  the  generic  model,  both  the  ENVIRONMENT 
and  SYSTEM  contain  FUNCTIONS.  A  FUNCTION  is 
performed  in  certain  STATEs,  causes  and  is  triggered 
by  EVENTS,  is  a  refinement  of  a  more  absuact 
FUNCTION  and  uses  and  sets  DATA.  STATEs  can 
have  subparts,  and  are  contained  in  a  STATE- 
MACHINE-ABSTRACTION,  which  can  be  a 
refinement  (detailed  definition)  of  a  more  abstract 
STATE.  EVENTS  contain  and  affect  STATEs. 
EVENTS  are  decomposed,  and  the  levels  of  event 
refinement  relate  to  the  levels  of  data  refinement. 
EVENTS  contain  EVENTs,  and  an  EVENT  at  an 
abstract  level  is  started  and  ended  by  EVENTs  at  a 
more  detailed  level.  EVENTs  are  a  function  of  DATA, 
which  is  also  decomposed. 

METHODS  EVALUATED 

Four  methods,  DCDS,  SA-RT,  SCR,  and 
STATEMATE  arc  described  in  detail.  A  set  of  tables 
compares  the  eight  methods  identified  above.  The  eight 
methods  are  described  in  more  detail  in  (WHIT  87b). 

DISTRIBUTED  COMPUTING  DESIGN  SYSTEM 
(DCDS):  DCDS  was  developed  as  a  real-time  system 
methodology  at  TRW,  based  on  research  sponsored  by 
the  U.S.  Army  Ballistic  Missile  Defense  Advanced 
Technology  Center,  BMDATC  (  ALFO  85).  Research 
began  in  1973,  The  method  includes  a  System 
Requirements  Engineering  Methodology  (SYSREM), 
and  a  Software  Requirements  Engineering 
Methodology  (SREM).  SYSREM  is  used  for  defining 
system  requirements  and  allocating  them  to  the  data 
processing  subsystem  and  the  hardware.  SREM  is  then 
used  for  defining  the  software  requirements. 

System  Requirements  Model:  An  important  concept  of 
SYSREM  is  that  a  system  function  acts  over  a  finite 
interval  of  time.  This  time  interval  is  the  maximum 
computation  time  and  is  specified  as  a  performance 
requirement.  A  system  action  is  represented  as  a 
function  that  accepts  items  arriving  during  the  time 
interval  and  transforms  them  into  items  that  appear  as 
output  during  the  time  interval. 

The  system  requirements  model  is  based  on  a 
stimulus-response  graph  model,  called  an  F-net,  with 
nodes  representing  subfunctions  and  edges  representing 
events,  or  flow  due  to  function  completion.  Ihe  system 


accepts  a  sequence  of  inputs  and  produces  a  sequence 
of  outputs.  In  applying  the  method,  normal 
functionality  is  modeM  first,  then  exception  conditions 
are  added.  At  the  primitive  level,  the  graph  model  is 
equivalent  to  a  set  of  state  machines  which  receive 
inputs  and  produce  outputs.  Some  of  these  state 
machines  are  realized  by  hardware  and  some  by 
software.  TIk;  software  is  allocated  to  processors  and 
interface  designs  arc  derived. 

Software  Requirements  Model:  After  system 
requirements  are  allocated  to  |»ocessors,  SREM  is  used 
to  define  data  processing  requirements.  For  each 
processor,  a  set  of  graph  models  called  R-ncts  define 
paths  from  input  to  output  interfaces.  Each  R-net  has  a 
single  entry,  which  may  be  an  interface  to  the 
environment,  and  one  or  more  exits,  which  are  cither 
interfaces  to  the  environment  or  terminators.  Processor 
inputs  and  outputs  arrive  and  depart  at  pons,  called 
interfaces,  and  arc  identified  as  messages  from  a 
hardware  device  or  another  processor.  R-ncts  arc 
composed  of  subnets,  which  are  similar  to  R-nels  and 
which  can  be  further  decomposed.  At  the  primitive 
level,  nondecomposable  functions,  called  alphas,  use 
inputs  and  data  derived  by  other  alphas  to  determine  the 
value  of  new  data  items.  An  R-nct  can  trigger  R-ncts 
via  an  event.  A  delay  associated  with  the  event  causes 
the  event  to  take  place  at  a  later  time.  Validation  points 
arc  nodes  which  can  be  added  to  an  R-nc(  for  specifying 
a  path  of  processing  through  the  net  and  its  subnets. 
Data  is  recorded  at  each  validation  point  during 
dynamic  specification  ixaversai,  simulating  system 
operation.  Minimum  and  maximum  response  times  can 
be  expressed  for  paths,  and  accuracy  requirements  can 
be  expressed  for  data. 

Evaluation  of  DCDS  with  Respect  to  Generic  Model: 
Both  the  software  system  and  its  environment  (the 
hardware)  contain  functions.  DCDS  describes 
functions,  using  both  a  function  and  net  concept  The 
net  details  the  abstract  function  and  consists  of 
subfunctions.  Functions  cause  and  are  triggered  by 
events,  and  use  and  set  data.  At  the  primitive  level  in 
SYSREM,  the  nets  are  equivalent  to  state  machines  and 
function  is  synonymous  with  state.  States  and  events 
cannot  be  decomposed.  Relationships  in  the  generic 
model  which  are  not  expressible  in  DCDS  arc 
illustrated  in  Fig.  3.  Rectangles  denote  objects  and 
circles  denote  relationships  between  these  objects.  The 
relationship  is  identified  by  the  number  in  the  circle. 
Relationships  3,5,6,7,10,11,12  are  not  expressible  in 
DCDS.  This  is  illustrated  by  the  lack  of  shading  in  the 
figure, 

DCDS  Temporal  Approach:  EXTDS  can  specify 
concurrency  and  selection,  but  not  nondcicrminism. 
Within  SREM,  maximal  concurrency  is  not  specifiable 
and  is  limited  to  fan-out,  fan-in  along  a  path.  In  some 


527 


cases  the  limitations  of  linear  temporal  order  arc 
imposed.  The  modeler  is  forced  to  show  one  path. 
Alternate  design  paths  cannot  be  shown. 

VOURDON  STRUCTURED  ANALYSIS  REAL- 
TIME  (SA-RT):  The  Yourdon  SA-RT  methods  are 
primarily  based  on  dataflow  (WARD  85).  SA-RT 
modeling  begins  with  the  development  of  an  external 
event  list,  to  help  engineers  focus  on  those  events  in  the 
environment  to  which  the  system  must  respond.  The 
events,  which  are  labeled  as  data  or  control,  are  used  to 
define  the  system  boundary  and  first  level 
decomposition. 

SA-RT  Model:  The  model  contains  a  hierarchy  of  data 
flow  diagrams  (DFDs).  The  lop  level,  called  the 
context  diagram,  describes  the  interface  between  the 
system  and  its  environment.  It  consists  of  one  data 
transformation  which  defines  the  system,  and  dataflows 
which  define  the  system  inputs  and  outputs.  Lower 
level  DFDs  contain  data  transformations  which  perform 
the  system  function  and  which  pass  data  (dataflows) 
among  themselves.  Data  can  also  pass  between  DFDs 
at  the  same  level.  Data  which  is  not  immediately  used 
is  retained  in  "datastores".  Primitive  (not  decomposed) 
data  transformations  are  described  in  minispecs,  which 
can  be  written  in  structured  English  or  a  specific 
program  design  language  (PDL),  or  can  be  described  by 
decision  tables.  Data  structure  is  defined  in  a  data 
dictionary. 

A  control  transformation  is  detailed  by  a  diagram 
called  a  state  transition  diagram  (STD),  similar  in  style 
to  a  Mealy  sequential  machine  model.  The  STD 
activates  arid  deactivates  processes  based  on  the  history 
of  signals  from  the  external  environment  and  other 
processes,  and  sends  signals  to  other  control  processes 
and  the  environment.  When  events  arc  primarily  data 
oriented,  the  DFDs  are  drawn  first;  when  events  are 
primarily  control  oriented,  so  is  the  system,  and  the 
STDs  are  drawn  first 

SA-RT  Temporal  Approach:  SA-RT  specifics 
concurrency  by  modeling  concuireni  STDs  and  by 
activating  parallel  processes.  Maximal  concurrency 
could  be  specified  but  that  is  unlikely,  as  maxim^ 
concurrency  is  not  a  primary  concern  of  SA-RT.  The 
SA-RT  dat^ow  concept  forces  the  SA-RT  modeler  to 
show  one  path;  design  alternatives  cannot  be  shown. 

Evaluation  of  Yourdon  SA-RT  With  Respect  To 
Generic  Model:  SA-RT  models  EVENTS  in  the 
ENVIRONMENT  but  not  FUNCTIONS  and  thus 
cannot  express  relation  (I)  in  the  generic  model.  Events 
are  not  decomposed  so  SA-RT  docs  not  express 
relations  (7,10,11,12),  as  shown  by  the  lack  of  shading 
In  Fig.  4,  SA-RT  Compared  to  Generic  Model. 


software  COST  REDUCTION  (SCR):  The 
Software  Ccmm  Reduction  (SCR)  fMojcci  was  established 
by  Naval  Research  Laboratory  (NRL)  to  prove  that 
modem  software  engineering  practices  could  be  used 
for  large  Navy  Software  projects.  The  A-7E  Aircraft 
software  was  rebuilt  as  a  test 

Supp<»l  for  InformaticMi  Hiding:  The  SCR  Software 
Requirements  Specification  is  based  on  the  principles  of 
separation  of  concerns  and  information-hiding  (HENl 
78.  HENl  80.  WHIT  83).  If  hardware  dependent 
information  is  needed,  the  analyst  loc^  at  the  Input  and 
Output  Data  Items  section  of  the  SRS.  This  seclicm  is 
organized  by  hardware  device,  and  by  input  and  output 
name.  The  information  in  this  section  is  not  repeated 
elsewhere  in  the  SCR  document  Only  the  acronyms 
for  input  and  output  names  are  used.  If  hardware 
related  attributes  change,  only  the  Input  and  OuQMtt 
Data  Items  section  should  have  to  be  changed. 

The  software  functional  lequiremems  are  described 
by  Modes  of  Operation,  and  by  Time-indqpendeni 
description  of  Software  Functions,  if  the  definition  of 
environmental  events  and  required  software  actions 
change,  only  these  sections  should  have  to  be  changed. 
Timing  requirements  are  specified  by  function  and  are 
listed  separately.  If  the  liming  requirements  change, 
only  this  section  should  have  to  be  changed. 

The  SCR  document  has  a  placeholder  for  a  section 
on  accuracy  requirements.  A  format  for  these 
requirements  is  not  defined.  The  Vndesired  Event 
Responses  section  contains  a  list  of  events  which  could 
occur  and  what  re^nse  is  desirable. 

The  SCR  Required  Subsets  section  identifies 
subsets  of  services  which  could  be  useful,  if  isolated,  in 
the  development  of  similar  systems.  The  Expected 
Types  of  Changes  section  identifies  areas  of  possible 
future  change  so  that  the  design  can  accommodate 
these. 

The  SCR  Glossary  provides  a  definition  of 
abbreviations,  acronyms,  and  technical  terms.  Sources 
for  information  are  given.  Alphabetical  Indices  provide 
access  to  inputs,  outputs,  modes,  and  functions.  The 
SGR  dictionary  contains  text  macros.  Text  macros  are 
phrases  used  to  simplify  the  document 

SCR  Model:  Templates  are  completed  for  inputs  and 
outputs.  A  function  is  defined  for  each  output. 
Functions  are  differentiated  as  either  periodic 
(synchronous)  or  demand  (asynchronous).  A  periodic 
function  may  be  initiated  and  terminated  or  may  always 
operate.  Tables  are  used  to  define  events  and 
conditions  that  cause  a  change  in  output  value  or  that 
activate/deactivaie  periodic  functions. 

Because  functions  differ  greatly  in  different  modes, 
modes  are  used  to  simplify  the  function  description.  A 
mode  in  the  SCR  methodology  is  a  system  state  defined 
by  the  history  of  events  in  the  system.  The  modes  arc 
grouped  by  mode  class.  The  system  can  be  in  more 
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than  one  mode  class  at  a  time.  The  modes  within  each 
mode  class  are  mutually  exclusive. 

Text  macros  take  the  place  of  compound  conditions 
and  data  item  definitions,  when  the  data  items  are  not 
outputs  or  directly  derived  from  inputs.  The  use  of  text 
macros  supports  information  hiding.  If  there  is  a 
change  in  the  set  of  compound  conditions  or  in  the  data 
item  definition,  only  the  dictionary  has  to  be  changed, 
not  the  entire  document 

Evaluation  of  SCR  Methods  With  Respect  to  Generic 
Model:  The  SCR  methods  do  not  model  functions  in 
the  ENVIRONMENT,  so  relaUon  (1)  in  the  generic 
model  is  not  expressed  (sec  Fig.  5).  STATEs  and 
EVENTS  are  not  decomposed,  so  state  decomposition 
relations  (5,6)  in  the  generic  model  are  not  expressed. 
Events  are  not  decomposed,  so  relations  (7,10,11,12) 
are  not  expressed. 

SCR  Temporal  Approach:  The  SCR  methods  are  based 
on  a  partial  order  concept  Since  the  input  to  output 
transformation  is  not  identified,  both  nondeterminism 
and  maximal  concurrency  can  be  specified.  The 
extended  machine  concept  simplifies  the  definition  of 
required  temporal  order. 

STATEMATE:  The  STATEMATE  model  is  a  graphic 
model  based  on  cooperating  sequential  processes, 
extended  abstract  state  machines,  predicate  logic  and 
dataflow  (ADCA  85,  HARE  86). 

STATEMATE  Model:  The  model  contains  templates, 
Statecharts,  and  Activity  Diagrams.  Templates  are 
completed  for  the  following  objects:  state,  condition, 
event,  action,  activity  (function),  signals/variables, 
modules,  and  channels  (which  connects  modules).  A 
Stalcchart  is  a  visual  extension  to  conventional  Slate 
Transition  Diagrams  used  in  SA-RT.  An  Activity 
Diagram  shows  data  and  control  flow  and  is  similar  to  a 
dataflow  diagram  in  SA-RT. 

The  Statechart  is  a  major  contribution  of 
STATEMATE.  A  number  of  state  transition  diagrams 
can  be  shown  on  the  same  chart,  displaying  parallelism, 
selection,  and  decomposition.  STATEMATE  can 
model  cooperating  sequential  processes  or  support 
structured  analysis  methods. 

Evaluation  of  STATEMATE  With  Respect  to  Generic 
Model:  STATEMATE  does  not  decompose  EVENTs. 
STATEMATE  can  express  all  relations  in  the  generic 
model  except  event  decomposition  relations 
(7,10.1 1,12),  as  shown  in  Fig.  6. 

STATEMATE  TgmpOTal  Approach:  If  Statecharts  were 
used  alone,  STATEMATE  would  be  based  on  a  partial 
order  concept  With  the  incorporation  of  user  specified 
Activity  Diagrams,  a  transformation  from  input  to 
output  is  defined  and  STATEMATE  is  reduced  to  a 


modified  branching  logic.  Within  the  branching  logic, 
maximal  concurrency  and  nondcteiminism  cannot  be 
specified. 

COMPARISON  OF  METHODS  WITH  RESPECT 
TO  METHOD  ATTRIBUTES 

The  evaluation  of  eight  methods  is  summarized  in 
table  1.  Methods  are  judged  with  respect  to  thirteen 
desirable  method  characteristics.  To  clarify  usage  of 
each  method's  vocabulary  in  the  remainder  of  the  table, 
the  first  row  in  sheet  1  of  the  table  compares  method 
objects  to  a  set  of  generic  objects.  The  last  two  rows  in 
sheet  3  identify  f(»mat  and  automated  support  for  each 
method. 

Table  2  compares  the  eight  methods  by  assigning  a 
value  (0,  2.5,  or  5)  which  measures  the  capability  of 
each  method  with  respect  to  each  attribute.  A  value  of 
zero  indicates  the  method  is  poor.  A  value  of  2.5 
indicates  the  method  is  fair.  A  value  of  five  indicates 
the  method  addresses  the  attribute  well.  The  SCR 
methods  score  the  best  with  a  total  score  of  47.5,  which 
shows  these  methods  to  be  far  superior.  EXTDS  follows 
with  a  score  of  30.  STATEMATE  received  the  third 
highest  score,  25. 

Formal  Basis  (Attribute  1):  Current  methods  are 
based  on  the  following  formal  concepts:  (1)  input  to 
output  mapping,  (2)  function  composition/ 
decomposition,  (3)  process  abstraction,  (4)  data 
abstraction,  (5)  cooperating  absbact  machines,  (S'*") 
cooperating  extended  abstract  machines,  and  (6) 
predicate  logic. 

•  DCDS  is  based  on  (1),  (2),  (3),  (5) 

•  HOS  is  based  on  (1),  (2),  (3),  (4) 

•  JSDisbasedon(l).(2),(3) 

•  SA-RT  is  based  on  (1),  (2).  (3),  (5) 

•  SCR  is  based  on  (1),  (2),  (3),  (5+),  (6) 

•  STATEMATE  is  based  on  (1),  (2),  (3),  (5+).  (6) 

•  PAISLey  is  based  on  (1),  (2),  (3),  (^ 

•  PAMELA  is  based  on  (1),  (2).  (3),  (4) 

SCR  and  STATEMATE  are  the  most  formal  of  the 
evaluated  methods. 

Model  Construction  (Attribute  2):  The  SCR  model  is 
the  easiest  to  construct  and  change,  as  long  as  mode 
definitions  are  stable,  as  each  data  item,  function 
producing  output  data,  and  text  macro  (definition  of 
compound  conditions  or  function  producing  internal 
data)  is  defined  in  one  and  only  one  place,  and  there  are 
no  explicit  interconnections  between  functions.  A 
problem  would  arise  if  there  were  major  changes  to 
mode  definitions,  as  function  definition  is  dependent  on 
modes.  If  utilizing  SCR  methods  for  model 
construction,  modes  should  be  introduced  after 
lequiremoits  have  stabilized. 
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Model  Comprehension  (Attribute  3);  The  SA-RT 
method  provides  a  good  visual  overview  of  a  logical 
(tesign.  Requirements  are  clearest  in  the  SCR  model  as 
they  are  explicitly  defined  in  event-action  causal 
statements:  "If  event  a  occurs  and  condition  b  is  true, 
perform  action  c".  There  is  a  problem  in  understanding 
how  the  model  works  together  as  this  is  not  explicitly 
defined.  There  is  no  way  to  tell  if  ail  the  requirements 
are  present  or  if  the  system  will  work,  if  built  to  these 
requirements.  The  static  interrelationship  of  SCR 
mc^t  parts  could  be  shown  using  diagrams  similar  to 
dataflow  diagrams.  These  diagrams  should  be 
generated  from  the  SCR  model.  If  methods  defmed  by 
Grumman  were  automated,  the  SCR  model  would  be 
executable.  Then,  the  dynamic  interrelationship  of 
parts  could  be  shown  by  running  the  model. 

Design  Independence  (Attribute  4):  If  modes  were 
not  used,  the  SCR  model  would  be  design  independent; 
the  modes  impose  design  constraints.  Ail  other  method 
models  are  design  dependent.  They  provide  a  design 
which  is  independent  of  physical  constraints. 

Stepwise  Refinement  (Attribute  5):  No  model 
adequately  refines  state,  function,  event,  and  data,  and 
relates  them  so  that  there  are  well  defined  functional 
layers.  STATEMATE  has  developed  the  best  definition 
of  state  decomposition,  but  engineers  using  the  method 
find  it  difficult  to  relate  decomposition  of  state  to 
decomposition  of  function.  These  engineers  use 
structured  analysis  methods  and  turn  processes  on  and 
off  by  a  state  transition  "controller"  at  each  level.  Our 
research  extends  the  SCR  methods  by  expanding 
requirements  defined  in  "text  macros."  When  text 
macros  define  conditions  not  directly  derivable  from 
inputs,  a  function  is  needed  to  evaluate  the  condition. 
These  functions  become  additional  cooperating 
processes.  In  addition,  we  decompose  state,  function, 
event,  and  data  so  that  system  requirements  can  be 
traced  to  software  requirements,  and  levels  of 
abstraction  can  be  defined. 

Separation  of  Concerns  (Attribute  6):  The  SCR 
methods  are  the  only  methods  that  have  made 
separation  of  concerns  a  goal  and  provide  the  best 
support.  All  other  methods  show  explicit  inter¬ 
connections  between  model  pans,  and  would  be 
difficult  to  partition  according  to  the  concerns  of 
different  specification  readers. 

Nonfunctional  Requirements  (Attribute  7):  It  has 
been  shown  (WHIT  83)  that  functional  timing 
constraints  could  differ  by  mode;  and  it  is  therefore 
reasonable  to  believe  they  could  differ  within  a  function 
according  to  event,  condition,  or  output  value.  The 
SCR  methods  address  tuning  constraints  well,  but  they 
can  only  be  associated  with  data  or  function.  PAISLey 
addresses  timing  constraints,  but  also  associates  them 


with  function  or  process.  DCDS  associates  timing 
constraints  with  objects  (data,  events,  functions)  or 
stimulus  response  path,  and  thus  is  the  most  powerful  in 
this  regard.  Other  nonfunctional  requirements  are  not 
as  well  addressed  as  timing.  PAISLey  specifies 
function  reliability.  DCDS  associates  nonfunctional 
requirements  with  all  objects,  but  the  requirements  are 
specified  texiually. 

Design  Tradeoff  Analysis  (Attribute  8):  Design 
tradeoffs  for  distributed  processing  depend  on  external 
event  frequency,  periodic  and  aperiodic  inter-system 
message  How,  nonfunctional  requirements,  estimated 
data  and  function  size,  scheduling  constraints, 
communication  prouxols,  and  on  processor,  storage, 
and  communication  device  constraints.  Depending  on 
allocation,  fault  detection  and  analysis  procedures  vary 
and  must  be  considered.  Tools  are  available  (FRAN 
87)  for  performing  analysis,  but  the  evaluated  methods 
do  not  provide  much  of  the  data  needed  for  using  this 
type  of  tool. 

Test  Identification  (Attribute  9):  A  product 
acceptance  specification  defines  externally  visible 
behavior  which  the  product  must  demonstrate,  and 
specifics  any  design  constraints  that  must  be  met  The 
SCR  methods  provide  the  most  support  for  identifying 
acceptance  tests,  as  they  describe  behavior  which  is 
externally  visible  or  which  could  be  externally 
determined.  STATEMATE  Statecharts  support  the 
identification  of  tests  as  the  flow  of  events  and  actions 
are  identified.  DCDS  functional  paths  support  test 
identification,  but  it  is  difficult  to  determine  which 
paths  should  be  tested  other  than  the  ones  shown.  CXhcr 
methods  do  not  support  testing  well. 

Identification  of  Missing  Requirements  To  Be 
Supplied  TBS(s)  (Attribute  10):  The  SCR  templates 
and  the  use  of  text  macros  for  information  hiding 
provide  the  best  support  for  identification  of  TBS(s)  or 
gaps.  It  is  harder  to  support  this  feature  with  a  method 
which  is  based  on  flow. 

Verification  (Attribute  11):  An  entity-relationship 
(ER)  model  and  operational  capability  are  desirable  for 
supporting  model  analysis  and  verification.  PAISLey, 
STATEMATE,  and  DCDS  provide  the  best  facilities  for 
model  verification.  All  three  methods  provide  an 
operational  capability.  A  PAISLey  definition  is 
statically  analyzed  in  a  manner  similar  to  a  software 
program.  STATEMATE  and  DCDS  perform  static 
verification  based  on  the  underlying  ER  model;  the 
database  is  queried  for  information.  We  have  defined 
an  ER  model  that  would  support  SCR  model  capture 
and  dynamic  execution.  Our  decision  tables,  fwedicate 
logic,  and  temporal  logic  su(^rt  static  verification. 
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Reusability  (Attribute  12):  The  SCR  methods  identify 
subsets  of  the  software  which  could  be  replaced  or  be 
reusable.  PAMELA  packages  data  and  function  by 
object  in  Ada  PDL,  and  these  packages  may  be 
reusable.  HOS  abstract  data  types  and  JSD  object 
structures  may  be  reusable.  STATEMATE  Siaiecharts 
identify  control.  They  may  be  reusable,  if  they  are 
organized  according  to  objects  in  the  environment  Our 
method  packages  requirements  by  object,  to  provide  a 
flexible  reusable  specification. 

System  and  Software  Definition  (Attribute  13):  Only 
DCDS  supports  both  system  and  software  definition 
and  provides  uaceability  from  one  to  the  other. 
Unfortunately,  the  path  concept  which  is  used  in  DCDS 
requires  that  design  decisions  be  made  when  developing 
the  model. 

Comparison  Of  Methods  With  Respect  To  Generic 
Real-Time  Model:  None  of  the  evaluated  methods 
contains  all  objects  and  relationships  in  the  generic  real¬ 
time  model.  JSD  is  the  only  evaluated  method  that 
decomposes  EVENT,  but  does  not  decompose  STATE 
or  STATE-MACHINE.  SA-RT  and  PAISUy  have 
STATE  decomposition.  The  SCR  method  uses 
STATES  and  STATE-MACHINEs,  but  does  not 
decompose  them. 

Comparison  Of  Methods  With  Respect  To  Temporal 
Approach:  The  most  powerful  partial  order  methods 
are  those  that  (1)  can  express  maximal  concurrency,  (2) 
specify  mutual  exclusion  at  every  level,  and  (3)  specify 
nondeterminism  without  definirg  alt  possible  paths. 
The  SCR  methods  support  all  three.  Item  (3)  is 
supported  through  the  use  of  extended  machines.  Ihis 
is  preferable  to  defming  all  possible  paths  which  makes 
the  definition  complex.  The  paths  can  be  generated 
from  the  extended  machine,  if  desired. 

A  PRAGMATIC  FORMAL  METHOD  (PFM) 

Our  evaluation  of  practical  methods  has  identified 
the  Software  Cost  Reduction  method  (SCR)  as  having 
the  best  characteristics  on  which  to  build  a  more 
powerful  method.  SCR  methods  have  a  strong  formal 
basis  and  can  be  augmented  to  eliminate  deficiencies. 
We  are  developing  a  method,  PFM,  that  enhances  the 
SCR  methods  with  capabilities  found  to  be  beneficial  in 
other  methods,  and  features  not  available  in  any 
evaluated  method.  PFM  is  discussed  in  detail  in  (WHIT 
87b).  PFM  logic  analysis  techniques  have  been  helpful 
in  checking  aircraft  mode  transition.  Other  capabilities 
have  not  as  yet  been  tested. 

Enhancements  to  the  SCR  methods  include: 

(1)  A  formal  Entity-Relationship  model  for  the  SCR 
method.  This  model  is  a  formal  description  of  the 


information  captured  about  a  target  system  when 
using  the  SCR  method.  (Examples  of  eniitcs  are 
modes,  events,  and  actions;  examples  of  relationships 
are  caused-when,  makes-trtie,  and  activated-by.) 

(2)  A  mapping  from  the  static  SCR  model  to  a  net  for 
behavioral  execution.  PFM-nets  are  similar  to  Petri 
nets,  and  are  not  dependent  on  textual  context  for 
execution.  They  should  execute  at  the  speed  of  Petri- 
nets. 

(3)  Support  fat  system  and  software  deftnilion.  Early 
in  system  definition,  system  engineers  understand 
conditions  that  affect  the  system,  but  have  not  as  yet 
decided  on  system  modes.  Early  system  models 
should  be  developed  without  modes,  and  modes 
should  be  introduced  when  the  definition  is  defied. 
To  support  this  fHocess,  we  have  created  a  method  for 
introducing  mo^s  into  a  model  without  modes.  Our 
methods  trace  data,  events,  conditions,  and  functiems 
from  system  to  software  requirements,  and  between 
different  levels  of  definition. 

(4)  Support  for  reasoning  about  a  tarect  system 
specification.  We  have  found  Allen’s  theory  of 
temporal  intervals  to  be  consistent  with  SCR 
concepts,  and  believe  it  would  be  useful  for  reasoning 
about  temporal  order  during  early  specification 
phases.  We  have  extended  predicate  logic  to  include 
events,  so  that  consistency  of  mode-transition  tables 
can  be  analyzed.  Engineers  using  the  logic  were  able 
to  detect  inconsistencies  in  the  A-7E  Aircraft 
Software  Requirements  Specification. 

(5)  Extensions  to  the  SCR  language.  State  and  event 
relationships,  including  decomposition,  arc  defined  in 
the  ER  model  and  are  used  in  target  system 
specification.  Indexing  and  predicate  qualifiers  have 
b^n  added  to  the  language  to  simplify  specifications 
involving  multiple  instances  of  an  object  (e.g., 
aircraft  being  tracked).  Other  extensions  permit  the 
specification  of  different  timing  requirements  in 
different  modes,  definition  of  data  senescence,  and 
worst  case  timing  for  response  to  stimuli,  for 
environment  response,  and  for  computation, 
communication,  and  storage  access  delays. 

PFM  Formal  Model:  A  PFM  model  is  a  definition  of 
a  target  system  that  is  to  be  built;  it  is  a  statement  of  the 
functional  requirements.  The  PFM  model  augments  the 
SCR  model  which  consists  of  data  templates,  function 
tables,  and  mode  transition  tables.  These  templates  and 
tables  are  easy  to  create  and  understand,  but  have 
underlying  formalisms.  Function  tables  are  extended 
state  machines,  where  machines  operate  concurrently 
with  and  have  state  knowledge  about  other  machines. 

PFM  uses  the  same  templates  and  tables  for  model 
definition.  In  addition,  PFM  uses  a  formal  ER  model 
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which  supports  automation.  The  ER  concept  is  used  to 
formally  define  the  objects  in  the  PFM  model  and 
relationships  between  these  objects.  The  ER  definition 
can  be  used  as  a  logical  database  design  for  storing 
target  system  models  and  is  needed  for  method 
automation.  The  ER  model  can  also  be  used  by  the 
analyst  to  determine  the  proper  use  and  power  of  the 
PFM  method. 

In  PFM  ER  diagrams,  rectangles  represent  objects; 
circles  represent  two-part  relationships:  triangles 
represent  three-part  relationships;  and  squares  represent 
four-part  relationships.  The  relationships  are  only 
specified  in  one  direction  to  reduce  diagram 
complexity,  but  the  relationships  are  complementary. 
Figure  7  defines  a  PFM  FUNCTION.  When 
relationships  are  three-part  or  four-part,  a  special  syntax 
is  used.  For  example,  the  three-part  relationship  (210) 
reads  V  ACTIVATED  BY  VI  WHEN  I  (FUNCTION 
ACTIVATED  BY  EVENT  WHEN  CONDITION). 
The  roman  numerals  I,  V,  VI  refer  to  objects  which  lake 
part  in  the  relationship.  The  roman  numerals  appear  in 
the  upper  right-hand  comer  of  rectangles.  We  have 
defm^  a  full  set  of  ER  diagrams  for  PFM. 

PFM-Nets:  PFM-nets  are  formally  defined  and  a 
process  transforms  a  PFM  static  model  of  a  Target 
System  to  a  PFM  net.  If  techniques  were  automated,  a 
PFM-net  could  be  computer  generated  from  the  Target 
System  model. 

PFM-nets  are  equivalent  to  Petri  nets  with  inhibitor 
arcs,  and  have  the  same  power  to  perform  system 
simulation  and  analysis.  Like  Petri  nets,  a  PFM-net 
contains  places,  transitions,  and  edges.  However,  in 
PFM-nets,  each  place  can  represent  a  condition  or  an 
event  Each  net  contains  a  set  of  objects.  Each  object 
contains  a  subset  of  the  places,  transitions,  and  edges  in 
the  net  PFM-nets  are  organized  by  object  as  they  ate 
generated  from  the  static  model. 

The  rules  for  moving  tokens  in  PFM-nets  are 
somewhat  different  than  the  rules  for  Petri  nets.  An 
EVENT  place  keeps  its  token  for  one  time  unit  as  an 
EVENT  is  an  instantaneous  occurrence.  CONDITION 
places  hold  their  tokens  until  the  associated  transition  is 
fired.  CONDITION  places  represent  the  states  of  an 
object  If  a  CONDITION  (state)  of  object  A  is  linked  to 
a  state-transition  for  object  A.  the  sute  of  object  A 
changes  when  the  transition  is  fired,  and  the  token  is 
moved  from  the  old  state  to  the  new  state  as  in  Petri 
nets.  On  the  other  hand,  if  the  state  of  object  A  affects 
a  state  transition  in  object  B,  the  state  of  object  A 
should  not  change  and  the  token  is  not  removed  from 
the  CONDITION  place  in  object  A.  Figure  8  shows 
interacting  house  temperature,  motor,  and  oil  valve  for  a 
home  heating  system. 

A  rapid  prototype  can  be  developied  from  the  PFM 
static  model  by  coding  the  operational  aspects  of 
actions  associated  with  significant  objects  and  events. 
The  rapid  prototype  consists  of  the  PFM-net  generau:d 


from  the  uncoded  portion  of  the  static  model, 
interacting  with  the  coded  portion.  The  PFM  rapid 
prototyping  process  is  effective,  according  to  Bruno  and 
Marchetto's  criteria.  'Rapid  prototyping  is  an 
alternative  paradigm  to  the  conventional  software  life 
cycle...  However,  the  prototyping  paradigm  is  in¬ 
effective  if  it  is  not  supported  by  a  development 
environment  that  provides  an  easy  derivation  of 
prototypes  from  formal  specifications  and  makes  the 
implementation  process  partially  automated’  (BRUN 
86). 

SUMMARY 

Existing  methods  are  not  adequate  for  modeling 
real-time  embedded  computer  system  requirements. 
Eight  well  known  methods  were  evaluated  to  determine 
method  strengths  and  weakiKsses,  and  to  determine  the 
optimum  basis  for  synthesizing  a  more  powerful 
method,  a  Pragmatic  Formal  Method  (PFM).  The  bca 
capabilities  of  existing  methods  were  selected,  and 
where  required  new  capabilities  arc  being  created,  to 
ensure  that  PFM  satisfies  the  needed  characteristics  fw 
requirements  modeling. 

FUTURE  RESEARCH 

Our  goal  is  to  develop  a  method  and  completion 
criteria  for  modeling  subsystem  boundary  properties 
during  the  requirements  definition  phase,  so  that 
interoperability  can  be  verified. 

Our  hypothesis  is  that  currently  available  modeling 
techniques  can  be  used  to  specify  subsystem 
interoperability,  but  better  support  is  needed: 

•  Detailed  steps  have  not  been  defined  to  suppon 
the  process. 

•  Static  and  dynamic  analysis  techniques  are 
required  to  support  model  integration. 

•  Methods  are  needed  for  specifying  and  testing 
feasibility  ot  performance  requirements. 

•  Support  is  needed  for  testing  the  interoperating 
model. 

The  philosophy  of  modeling  for  interoperability 
will  be  on  modeling  from  the  boundary  of  systems, 
inwards  as  in  the  SCR  methods  and  PFM.  These 
methods  define  system  outputs  in  terms  of  inputs, 
conditions,  events,  and  slates,  providing  inter¬ 
dependencies  of  dau,  not  normally  found  in  an 
Interface  Requirements  Specification. 

System  context  will  be  defined  in  detail 
incorporating  sequences  of  data  entering  and  leaving 
the  subsystem,  as  in  the  Requirements  Driven  Design 
(RDD)  method  which  is  based  on  DCDS  (ALFO  91). 

Required  communication  mechanisms  for 
interoperating  subsystems  will  be  incorporated  into  the 
model  as  in  PAlSLey. 

Engineers  must  specify  and  allocate  requirements 
for  end-to-end  timing  of  critical  processing  flows,  and 
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specify  delays  inherent  in  existing  components. 
Method  developed  in  our  earlier  research  and  by 
Jfahanian  and  Mok  (JAHA  86)  will  be  incorporated  into 
PFM  methods  and  used  to  document  these  aspects. 

Logic  and  temporal  analysis  techniques,  developed 
by  Grumman  for  analyzing  consistency  when  using 
SCR  methods,  will  be  used  to  ensure  consistency  of 
integrated  subsystem  models. 

Dynamic  methods  are  needed  to  test  and  integrate 
subsystem  behavior.  Researchers  have  proven  that  state 
machines  can  be  mapped  to  Petri  nets,  but  succinct 
methods  like  SCR  and  STATEMATE  use  extended 
state  machines.  PFM-nets  will  be  further  defined  and 
automated  to  dynamically  ten  behavior  and  timing  of  a 
set  of  extended  state  machines. 

We  will  incorporate  techniques  for  checking 
overriding  constraints.  Gist  (BALZ  82}  uses 
cooperating  machines  called  "demons"  to  check  for 
conditions  that  are  always  prohibited.  In  a  similar 
manner.  Harel  (HARE  92)  establishes  a  behavioral 
specification  called  a  "watchdog"  and  uses  reachability 
tests  to  check  whether  unwanted  states  can  be  entered. 

Information  will  be  modeled  in  a  maintainable 
manner,  grouping  information  by  objects  that  occur  in 
the  environment  and  system. 

Detailed  process  steps  and  completion  criteria  will 
be  defined. 

As  a  pilot  project,  boundary  properties  of 
subsystems  currently  under  development  in  our  avionics 
laboratory  will  be  modeled.  The  boundary  properties  of 
several  subsystems  will  be  integrated  into  one 
consistent  model,  that  will  be  executed  to  verify 
interoperability, 
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FIGURE  1  DtSTRIBUTEO  EMBEOOED  SYSTEM  MODEL 


FIGURE  ?  GENERIC  REAL-TIME  MODEL 
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FIGURES  SCfVATE  COMPARED  TO 
GENERC  REAt-TIME  MODEL 


FIGURES  STATEMATE  COMPARED  TO 

GENERIC  REAL  TIME  MODEL 
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JSO  SA-RT  SC(VA7E  STATEMATE  PAISLEY  PAMEU 


(1)INPUTTOOUTPUTMAPPINQ,{2)  Ri^K;TI0NCOMPCSITIO^W3ECOMPO8ITI0N.  (3)  PROCESS  ABSTRACTION 

<4)  DATA  ABSTRACTION.  (6)  COMMUNICATINO  ABSTRACT  MACHIfffiS.V  )  COMMUNCATINQ  EKTENOEO  ABSTRACT  MACHINES. 
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ATTRIBUTE  SUPPORTED  DCDS  "ns  JSO  SA-BT  SCR/A7E  STATE-  PAISLEY  PAMELA 
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TABLE  2  METHODS  RATED.  SUPPORT  FOR 
DESIRABLE  CHARACTERISTICS 
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Abstract:  Problems  in  the  development  of  large 
computer-based  systems  indicate  that  a  new  discipline 
is  n^ed  at  the  systems  engineering  level.  Designing 
systems  with  distributed  processing  and  databases 
requires  analysis  of  critical  end-to-end  processing  flows 
to  determine  feasibility  and  proper  allocation.  An 
expanded  skills  base  would  be  required  to  enable  either 
sohware  or  systems  engineers  to  perform  the  necessary 
tradeoff  studies  concerning  software,  hardware,  and 
communication  components. 

This  paper  informs  managers,  engineers,  educators, 
and  researchers  about  the  need  for  computer-based 
systems  engineering  and  the  strategic  opportunities  this 
discipline  provides  for  systems  engineering 
improvement 

INTRODUCTION 

A  computer-based  system  (CBS)  includes  all 
system  parts  that  process  and  control  information. 
Computer-based  systems  engineering  (CBSE)  requires 
that  activities  of  systems  engineering  be  applied  to  the 
CBS.  CBSE  responsibilities  include  definition  of 
critical  processing  flows,  allocating  software  to 
distributed  systems,  making  decisions  concerning 
computer  hardware,  and  sizing  communication 
components. 

The  IEEE  Computer  Society  Task  Force  on  CBSE 
was  created  in  1991  to  promote  the  discipline, 
encourage  research  in  the  field,  and  establish  a 
framework  for  education  and  training  (Agrawala,  1991; 
Lavi,  1991a;  Lavi,  1991b).  The  task  force  created  three 
working  groups:  education,  research,  and  practice. 
This  task  force  paper  presents  the  case  for  a  CBSE 
discipline,  discusses  current  practices,  and  identifies 
market  and  social  imperatives  for  improving  the  state  of 
the  practice. 

The  paper  addresses  the  following  issues: 

•  Examples  of  CBSs  and  the  number  of  engineers 
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David  Owens,  SPC 
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currently  practicing  CBSE  in  the  United  States, 

•  I>efinition  of  a  CBS  and  CBSE  responsibilities, 

•  Standard  and  advanced  practices,  and  areas  for 
further  research, 

•  Strategic  targets  for  computer-based  systems 
engineering  improvement.  These  targets  can  be 
implemented  today,  and  would  have  a  high  benefit-to- 
cost  ratio  in  terms  of  process  and  product  improvement 
Implementing  the  suggested  improvements  should 
increase  corporate  profitability  and  customer 
satisfaction. 

SCOPE  OF  CBSE 

System  Types.  Systems  have  become  more  encom¬ 
passing,  complex,  event-driven,  physically  distributed, 
and  networked.  Table  1  shows  examples  of  systems. 

Problems  with  complex  and  stringent  system 
requirements  are  reflected  in  the  following  examples: 

•  Space  Station  Freedom  has  approximately  1.5 
million  requirements. 

•  The  Air  Traffic  Control  System  requires  down¬ 
time  of  no  more  than  six  seconds  per  year  for  critical 
functions. 

•  The  next  generation  of  fighter  aircraft  requires 
extensive  computer  control  to  aid  pilot  control.  Wing 
and  tail  control  surfaces  must  be  regulated  thirty  times 
per  second. 

•  The  modem  automobile  has  more  computing 
power  than  the  Lunar  Lander  when  it  landed  on  the 
moon.  It  is  projected  that,  by  the  year  20(X),  the 
automobile  computer  will  have  more  interactive  sensor- 
control  loops  than  the  largest  chemical  refinery  in 
Baytown  Texas. 

•  Computerized  systems  can  cause  unsafe  condi¬ 
tions,  leading  to  increa^  requirements  to  “prove”  the 
safety  of  system  designs. 

System  Designer  Role.  A  system  designer  translates 
requirements  into  designs,  verifies  that  system  behavior 
meets  requirments,  allocates  functions  and  behavior  to 


543 


Table  1.  Exam 

Dies  of  Systems 

Defense 

Comm¬ 

ercial 

Public 

Tele¬ 

comm 

Auto¬ 

motive 

Finan¬ 

cial 

VERY  HIGH  COMPL  EXITY 

InieU- 

tgence 

Fusion 

PIMS 

Weather 

Forecast 

ing 

Intellig. 

High¬ 

ways 

Space 

Station 

Airline 

Reser¬ 

vations 

Air 

Traffic 

Control 

Central 

Office 

Switch 

Econo¬ 

metric 

Model 

MEDIUM  COMPLEXITY 

Cruiser 

orBl- 

Bomber 

B-777 

BART 

Network 

Control 

Dealer 

Net¬ 

works 

Port¬ 

folio 

Mgmt 

Logis¬ 

tics 

Depot 

Mfg 

Auto¬ 

mation 

Muclear 

Reactor 

Campus 

Back¬ 

bone 

Vehicle 

Mgmt 

NYSE 

LOW  COMPL  EXITY 

Smart 

Bomb 

Cell 

Control 

LAN 

Control 

Cruise 

Control 

Auto- 

TeUer 

components,  and  builds  system  descriptions.  During 
the  development  and  test  phases  of  a  project,  a  system 
designer  interprets  requirements,  guides  other  system 
designers  of  related  systems,  and  directs  tradeoff 
studies.  Although  this  effort  amounts  to  only  five  to  ten 
percent  of  the  total  system  project,  the  adequacy, 
accuracy,  and  timeliness  of  this  work  is  critical  to  the 
success  of  all  complementary  activities. 

Number  of  Computer-Based  Systems  Designers. 
The  population  of  system  designers  must  be  identified 
by  what  they  do,  rather  than  by  industry,  type  of  system 
being  produced,  or  even  job  titles.  Ascent  Logic 
Corporation  has  estimated  there  were  300,000  people 
doing  system  design  in  the  United  States  in  1988  and  it 
is  expected  the  number  will  exceed  360,000  by  1996. 
This  is  partly  from  increased  employment  as  forecasted 
by  (Hearings  before  Joint  Economic  Committee  1988), 
but  also  because  more  workers  are  becoming  capable  of 
designing  systems.  Ascent  Logic  estimates  that 
approximately  one-third  to  one-half  of  the  designers 
(lOOK  to  150K)  are  focusing  on  CBSE.  The  world¬ 
wide  population  is  estimated  to  approximately  double 
these  figures. 

ENGINEERING  COMPUTER-BASED  SYSTEMS 

Definition  of  CBS.  A  CBS  consists  of  all  components 
necessary  to  capture,  process,  transfer,  store,  display, 
and  manage  information.  Figure  1  shows  a  CBS 


reference  model  for  a  distributed  system.^  This  is  one 
of  a  number  of  reference  models  under  discussiem 
(Jackson,  1992;  Alford,  1991;  Oliver,  1991;  White, 
1991,  White,  1987).  In  the  figure,  processing  entities 
include  analog  and  digital  hardware,  fumware,  and 
software.  Communications  entities  provide  network 
services  that  allow  multiple  processing  entities  to 
exchange  information,  transparent  to  application 
software.  Information  services  provide  f(»'  exchange  of 
information  between  processing  entities  and  sun-age 
devices,  e.g.,  disks  or  tapes.  Human/computer  inter¬ 
action  services  including  windows,  graphics,  and 
command  services  support  interaction  between 
processing  entities  and  people.  CBSs  interact  with  the 
physical  environment  through  sensors  and  actuators, 
and  also  interact  with  external  CBSs. 

CBSE.  The  nature  of  CBSs  requires  a  different  systems 
engineering  knowledge  base  th^  that  normally  required 
to  engineo-  non-CBSs.  All  CBSs  involve  application 
software  and  associated  services  that  are  conceptual  in 
nature  and  inherently  difficult  U>  grasp.  Requirements 
satisfied  by  software  are  frequently  ambiguous  and 
subject  to  change,  leading  to  CBS  design  changes  that 
may  sacrifice  system  architecture  flexibiUty  to  ensure 
performance  requirements  are  met.  Furthermore, 
software  changes  in  complex  CBSs  can  result  in 
unpredictable  behavior,  both  internal  and  external  to  the 
CBS.  The  distributed  nature  of  a  CBS  is  unique  in  that 
CBS  resources  are  frequently  geographically  dispersed 
and  under  the  control  of  different  organizations.  To 
exchange  data  among  such  systems  requires  interfaces 
to  describe  content,  and  protocols  to  describe  format 

A  dedicated  discipline  is  advocated  to  address  these 
complex  and  unique  CBS  attributes.  CBSE  as  a 
discipline  is  analogous  to  systems  engineering  in  the 
traditional  sense.  What  differs  is  the  focus,  and  hence 
the  skills  necessary  to  successfully  perform  CBSE. 
CBS  engineering  is  concerned  with  the  following 
le^nsibilities; 

>  Design  decisions  concerning  the  distributed 
nature  of  the  CBS  (its  architecture), 

•  Allocation  of  resources  to  component  develiqwrs 
and  management  of  the  coordinated  process, 

•  Allocation  of  functions  and  data  to  CBS 
resources  (processors,  software,  datastores,  displays, 
Human  Computer  Interface). 

•  CBS  strategies  with  respect  to  safety,  security, 
and  fault  tolerance, 

•  Global  system  management  strategy, 

•  Definition  of  information  services, 

•  Performance  allocations  (timing,  sizing, 
availability), 

•  Testing  (component,  integration,  inter- 
qierability  with  the  external  enviixmment). 


*This  figure  is  taken  from  (POSIX  1991). 
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Figure  1 .  Distributed  CBS  Reference  Model 


>  Logistics  support  (maintenance,  training,  conftg- 
uration  management), 

•  Implementation  of  the  CBS  within  the  existing 
environment. 

In  perfc»ming  these  tasks,  engineering  tradeoffs 
must  be  made,  prompted  by  operational  requirements, 
limited  resources  (e.g.,  finances,  personnel),  CBS 
component  design  (e.g.,  bandwidth,  memory  size,  I/O 
subsystem,  database  system),  system  environment 
constraints  (e.g.,  operational  environment,  security 
measures),  and  performance  thresholds,  (e.g., 
timeliness,  throughput,  availability). 

CBSE  STATE  OF  PRACTICE 

This  section  addresses  standard  and  advanced 
practice,  and  topics  where  further  research  could  result 
in  significant  process  improvement.  Key  areas  are 
addressed; 

•  The  CBSE  Process, 

•  Requirements  Defmition, 

•  Design  (Process  and  Architecture), 

•  Interfaces, 

•  Management, 

•  Process  automation, 

•  Documentation,  and 

•  Interpersonal  Communication. 

This  description  of  CBSE  state  of  practice 
addresses  many  of  the  most  important  issues,  but  is  not 
meant  to  cover  all  issues.  The  CBSE  State  of  Practice 
Weeing  Group  would  like  to  hear  from  other  CBSE 
practitioners  concerning  additional  state  of  practice 
issues. 


CBSE  Process.  The  CBSE  process  must  be  tightly 
integrated  with  the  systems  engineering  process. 
Industry  has  recognized  the  size  and  scope  of  problems 
with  systems  engineering  processes,  as  evidenced  by 
the  recently  formed  AlAA  Systems  Engineering 
Working  Group,  the  National  Council  on  Systems 
Engineering  (NCOSE),  and  Europe's  Atmosphere 
Project  Recognition  of  the  need  for  a  special  discipline 
in  CBSE  is  just  emerging,  as  seen  by  the  IEEE 
Computer  Society  Task  Force  on  Computer-Based 
Systems  Engineering. 

Table  2  summarizes  the  state  of  the  CBSE  process. 
Numbers  in  the  table  refer  to  paragnq)hs  in  the  text 
TTie  "1"  in  "1)  Undefmed  CBSE  process"  indicates  that 
more  information  can  be  found  in  the  fust  paragraph;  1) 
Process  definition. 

Table  2.  State  of  CBSE  Process _ 

STANDARD  ADVANCED  RESEARCH 

PRACTICE  PRACTICE  NEEDED 


l)Undcfincd 
CBSE  process 


2)Systems 

engineering 

process 

modeling  (may 
include  CBSE 
tasks) 


1) Methodsand 
tools  for  process 
modeling 

2)  CBSE  roles 
and  tasks 


1)  Process  definition.  In  general,  corporations 
have  not  defined  their  CBSE  processes.  Some 
companies  use  static  modeling  methods  and  tools  to 
capture  and  document  their  systems  engineering 
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processes  while  establishing  quantifiable  process 
metrics.  A  few  companies  use  executable  modeling 
tools  for  process  modeling,  that  for  the  most  part  were 
developed  for  requirements  modeling  and  were  based 
on  state  machine  or  Petri  net  models.  Industry  needs 
better  methods  and  tools  for  process  modeling. 

2)  Process  modeling  needs.  Industry  needs  a 
well-defined,  executable  CBSE  process  model  that 
incorporates  relationships  to  the  overall  systems 
engineering  process.  The  defmed  CBSE  process  mus. 
be  flexible  enough  to  foster  process  improvement  in  a 
timely  manner.  Industry  needs  metrics  to  measure 
process  improvement  and  (Hxxiuct  quality. 

3)  A  CBSE  discipline  is  needed.  CBSE  must 
start  at  the  beginning  of  the  systems  engineering 
process,  supporting  feasibility  analysis  and 
requirements  allocation.  Allocation  decisions  are  made 
early  in  the  systems  engineering  life  cycle,  scrnietimes 
before  proposal  submittal.  These  decisions  can  have 
significant  impact  on  critical  processing  flows, 
requiring  analysis  of  performance  and  accuracy. 

Requirements  Definition.  Table  3  summarizes  the 
state  of  practice  in  requirements  definition. 


Table  3.  State  of  Requirements  Definition 


STANDARD 

PRACTICE 

ADVANCED 

PRACTICE 

RESEARCH 

NEEDED 

l)Natuial 

language  for 

systems 

engineering 

l)Inconsisteni 

languages 

2)Applying 

software 

modeling 

techniques  to 

systems 

engineering 

1) Consistent 
language 

2) System  modeling 

3) Model  generated 
scenarios 

1)  Language.  Natural  language  is  the  standard 
practice  for  systems  engineering  specification.  Most 
software  engineers  use  software  requirements  models, 
frequently  modeling  data,  data  flow,  and  control  flow. 
Hardware  engineers  develop  models  using  VHDL(s). 
The  use  of  inconsistent  languages  by  different 
disciplines  leads  to  communication  and  traceability 
problems.  Information  passed  from  systems 
engineering  to  other  specialities  must  be  complete  and 
must  be  in  the  target  methodology  and  notation. 
Information  must  be  transferred  without  manual 
reentry.  Attention  needs  to  be  paid  to  both  the  semantic 
issues  and  the  tool  interface  issues  involved  in  this 
transition.  Changes  must  be  propagated  in  both 
directions. 

2)  Methods  and  models.  More  advanced 
practitioners  are  applying  software  engineering  methods 
to  systems  engineering.  These  methods  are  not 
sufficient  to  support  all  CBSE  functions.  Practitioners 


need  effective  methods  for  specifying  system 
performance  requirements  that  support  system  design 
and  derivation  of  software  performance  requirements. 
They  also  need  useful  paradigms  that  {Komote  reuse  of 
existing  requirements  specifications,  and  use  of  non- 
developmental  items  (NDI).  Analysis  for  completeness, 
consistency,  and  correctness  is  primitive.  Tools  should 
apply  logic,  numerical  analysis,  and  domain 
understanding  to  the  analysis  problem. 

3)  Operational  scenarios.  Practitioners  need 
models  for  generating  a  wide  range  of  (^rational 
scenarios,  including  many  with  low  probability.  They 
need  these  scenarios  to  determine  whether  the 
requirements  are  consistent  and  adequate  in  terms  of 
defining  and  constraining  the  behavior  of  the  system 
within  its  enviroiunenL 


Design  Process.  Table  4  summarizes  the  state  of  the 
CBSE  design  process. 


Table  4.  Stale  of  CBSE  Design  Process 


STANDARD 

PRACTICE 

ADVANCED 

PRACTICE 

RESEARCH 

NEEDED 

2)Stalic 

functional 

models 

2)Hard  coded 
dynamic  models 

3) Planning  for 
change 

4) Starting 
decision  capture 

l}Systemsy 
software  dialog 
2)Suppon  for 
tradeoffs 
4}Decision 
analysis 

5)Rcuse 

1)  System/software  engineering  dialog.  In  too 
many  cases,  systems  engineers  do  not  understand  what 
information  should  be  provided  to  CBS  implementers. 
Information  is  provided  in  the  wrong  sequence,  and  is 
not  analyzed  to  sufficient  detail.  Ihe  dialog  with  the 
implementer  is  frequently  not  structured,  e.g.,  providing 
the  correct  level  of  information  to  determine  feasibility 
versus  the  correct  level  to  design. 

2)  Static  and  dynamic  models.  Engineers,  using 
standard  software  practices,  produce  static  functional 
models  that  provide  various  representations  for  human 
review  and  analysis.  These  engineers  seldom  use 
dynamic  modeling  techniques  such  as  Peui  nets.  They 
frequently  hard-code  dynamic  models  of  fixed-point 
designs,  and  use  few  parameters  to  support 
requirements  changes  and  tradeoff  analysis.  They  have 
a  limited  ability  to  trade  capability  (data  flow  and  logic 
control)  versus  resource  utilization  and  performance. 
They  have  done  little  research  on  nonlinearities  caused 
by  scale-up  of  ca^ability/data,  and  seldom  analyze  or 
model  scale  effects. 

3)  Planning  for  change.  System  developers  are 
beginning  to  use  open  system  architectures,  as 
engineers  plan  for  change.  CBSs  can  change 
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significantly  during  their  life  cycle,  and  researchers 
addressing  an  improved  process  must  address  this  fact 
New  processes  must  handle  multiple  variants  for  a 
system  ("system  families")  efficiently.  Developers 
must  trace  the  implication  of  design  changes  to/from 
higher  level  specifications  and  across  families.  To 
suppcMt  change  analysis,  CBSE  needs  effective  design 
decision  capture. 

4)  Ramifications  of  systems  engineering 
decisions.  Systems  engineers  make  "high-level"  or 
"architectural"  or  "system-wide"  design  decisions. 
These  are  policy  decisions  that  should  inform  and 
constrain  subsequent  design  and  management  decisions 
relating  to  various  subsystems.  It  is  unclear  how  to 
present  and  {vopagate  these  key  decisions,  or  how  to 
monitor  subsystem  design  decisions  to  ensure  that  they 
are  not  in  conflict  with  system-level  design  decisions. 
In  addition,  many  high-level  or  system-level  decisions 
result  in  major  consequences  to  the  CBS.  Enginem  do 
not  understand  these  consequences  when  the  decisions 
are  made.  Research  is  needed  to  determine  what  types 
of  decisions  have  major  consequences,  what  are  the 
ramifications  of  these  decisions,  and  how  such 
decisions  should  be  made. 

5)  Specification  for  Reuse.  Industry  needs 

effective  models  of  system  classes  and  of  common 
subsystems/components  that  are  used  across  classes. 
These  models  are  important  for  supporting  domain 
analysis  and  effectively  storing  high-level  system 
segment  or  subsystem  descriptions  in  a  library.  To 
effectively  use  a  library,  it  is  important  to  be  able  to 
assert;  "what  I  want  is  exactly  like  that  except  for _ " 

Design  Architecture.  Table  S  summarizes  the  state  of 
CBSE  architecture. 


Table  5.  Slate  of  CBSE  Architecture 


STANDARD 

PRACTICE 

ADVANCED 

PRACTICE 

RESEARCH 

NEEDED 

IjPerformance  at 
expense  of 
architecture 
2)Component 
hierarchies,  few 
guidelines 

2)Standard 
architectures, 
open  systems 

2) IX)main 
architectures 

3) Empirical  data 
relating  methods 
to  quality  designs 

1)  Performance  at  expense  of  architecture.  In 
standard  practice,  design  emphasis  is  on  system 
performance  at  the  expense  of  other  architecture  issues. 
These  performance  optimized  solutions  are  inflexible 
and  hard  to  adapt  to  changing  requirements. 

2)  Component  hierarchies,  standard  architec¬ 
tures.  I>esigners  usually  decompose  unprecedented 
systems  into  component  hierarchies  using  few 
guidelines.  Standard  domain  architectures  are 
appearing  as  building  blocks  for  precedented  systems. 


3)  Empirical  data  unavailable.  Designers  do  not 
sufficiently  use  partitioning  rules  associated  with 
maintainable  systems,  so  there  is  not  enough  analytical 
data  to  validate  these  rules. 

Interfaces.  Tabic  6  summarizes  the  sutc  of  interface 
definition. 


Table  6.  State  of  CBSE  Interface  Definition 


STANDARD 

PRACTICE 

ADVANCED 

PRACTICE 

RESEARCH 

NEEDED 

l)Documented 
by  data  element 
description  and 
protocol 

2)Environment 
and  subsys'em 
documentation 
and  simulation 

1) Interiace 
nKxkiing 

2) Capabiiities 
for  modeling 
liost  systems 

3) HCI 
partitioning 

1)  Level  of  abstraction  needed.  Engineers 
normally  document  interfaces  by  data  element 
description  and  protocol,  which  is  inadequate. 
Research  is  needed  to  define  the  level  of  abstraction 
that  supports  modeling  the  boundary  properties  of  a  set 
of  subsystems,  to  verify  that  their  combined  behavior 
matches  the  system  requirements.  If  this  can't  be  done, 
a  system  must  be  built  and  tested  before  it  is  known 
whether  it  meets  the  requirements. 

2)  Modeling  host  systems.  Engineers  need 
capabilities  for  modeling  host  systems,  to  predict  the 
effects  of  the  designed  system  on  the  host.  Host 
systems  are  frequently  human  activities  systems,  within 
which  our  designed  systems  are  to  be  used. 

3)  Human/computer  partitioning.  When 
engineers  design  systems,  they  implicitly  specify  the 
tasks  of  system  users.  Designers  need  better  knowledge 
of  how  to  partition  functions  and  responsibilities 
between  people  and  designed  systems.  Designers 
should  not  assign  functionality  to  a  system  just  because 
it  is  technically  feasible,  ot  even  cost  effective.  They 
need  better  human-computer  interface  (HCI)  models. 
They  usually  do  not  model  this  interface  adequately 
since  there  is  no  defined  process  for  integrating  the 
knowledge  of  all  stakeholders. 

Management  Table  7  summarizes  the  state  of  CBSE 
management 
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Table  7.  State  of  CBSE  Management 


STANDARD 

PRACTICE 

ADVANCED 

PRACTICE 

RESEARCH 

NEEDED 

i)Good  cost 
data  for 
precedented 
systems 

2) Domain  specific 
WBSs 

3) Risk  management 

4) Data  collection 

4)Process  and 
product 

characterization 
for  data 
collection 

1)  Costing.  Different  cost  models  result  in 
different  cost  estimates,  using  gross  overall  metrics 
based  primarily  on  lines  of  code.  These  cost  models 
apportion  total  cost  and  effort  estimates  to  each  life- 
cycle  phase;  engineers  need  fine  grain  metrics  for  each 
phase.  Industry  can  usually  estimate  the  cost  of 
precedented  jobs  but  still  has  problems  costing 
unprecedented  jobs. 

2)  Work  Breakdown  Structure  (WBS).  Large 
projects  normally  track  cost  and  schedule  against  a 
WBS.  If  the  WBS  and  CBS  architecture  are 
inconsistent,  the  WBS  cannot  support  CBS  status 
tracking.  This  inconsistency  frequently  exists,  since 
engineers  define  the  WBS  before  they  know  the 
computer  system  architecture.  To  solve  this  problem, 
industry  needs  WBS(s)  that  support  each  CBS 
application  domain. 

3)  Risk  management.  Successful  projects 
perform  risk  management,  but  use  primiUve  methods. 
Generally  engineers  do  not  analyze  detailed  data  from 
previous  programs  to  understand  cause  and  effect. 
Some  risk  assessment  tools  exist,  but  they  do  not 
interact  with  requirements  and  design  tools,  where 
engineers  identify  risk  issues.  Industry  needs  general 
tool  suites  that  support  risk  management  views. 

4)  Data  collection.  In  good  practice,  engineers 
collect  data  to  manage  the  current  project.  In  best 
practice,  they  use  data  to  improve  engineering  processes 
and  make  predictions  for  future  projects.  Research  has 
provided  some  capabilities  for  collecting  software 
sizing  data  and  using  the  data  to  size  new  software 
components.  CBSE  needs  better  techniques  for  process 
and  product  characterization,  and  industry  must  collect 
and  have  access  to  the  data  it  knows  how  to 
characterize. 

Process  Automation.  Table  8  summarizes  the  state  of 
CBSE  process  automation 

1)  Levels  of  tool  support.  There  are  several 
layers  of  CBSE  tool  support  At  the  lowest  functional 
level,  automated  tools  accomplish  a  speciHc  task  in  a 
particular  phase  of  the  system  life  cycle.  An  automated 
requirements  capture  tool  is  an  example  of  such  a  tool. 
At  the  next  level,  tools  assist  the  systems  designer 
accomplish  many  tasks  across  many  phases  of  the  life 
cycle.  A  requirements  tracing  tool  provides  this  level  of 


support.  Further,  there  are  tools  and  techniques 
designed  to  support  management  and  integral  processes 
across  all  system  life-cycle  phases.  Examples  of  these 
tools  are  problem  tracking  and  reporting  tools  and 
project  tracking  and  reporting  tools.  At  the  very  highest 
level,  there  is  a  class  of  tool  that  provides  a 
"framework”  into  which  individual  tools  are  (kployed 
appropriately  over  the  life  cycle,  and  that  maintains  all 
records  and  documentation  for  system  development  and 
operation.  Analogous  concepts  are  found  in  CAD  with 
the  "CAD  framework  initiative”  and  in  software  with 
"Integrated  Project  Support  Environments." 


T able  8.  State  of  CBSE  Process  Automation 


STANDARD 

PRACTICE 

ADVANCED 

PRACTICE 

RESEARCH 

NEEDED 

l)Task  oriented, 
some  tool 
interfaces, 
software 
environment 

l)Frameworks, 
approach  to 
repetitive  tasks 

Ijinlegrated 
Systems/CBSE 
environment 
2)Efricient 
change  mgmt 

2)  Higher  level  tools  needed.  Unfortunately, 
there  are  few  examples  of  sophisticated,  high-level 
process  automation  tools  that  systems  engineers  widely 
use.  A  recognition  of  the  ne^  exists,  but  tool  suites 
have  not  kept  pace  with  this  recognition.  An  area 
where  such  tools  are  needed  is  in  change  management. 
Efficient  change  entry  is  a  major  problem  for  large 
complex  systems.  Major  changes  may  require  updating 
information  about  a  single  entity  in  several  places  in 
databases  of  a  number  of  tools.  This  multiple  manual 
update  process  is  both  costly  and  error  prone.  A 
standard  model/schema  underlying  all  tools  in  use  is  a 
desirable  solution. 

Documentation.  Table  9  summarizes  the  state  of 
CBSE  documentation. 


Table  9.  State  of  CBSE  Documentation 


STANDARD 

PRACTICE 

ADVANCED 

PRACTICE 

RESEARCH 

NEEDED 

l)Natural 
language  specs 

l)Databases  and 

documentation 

generators 

Ijlntegral  role 
with  process 
2)Designed 
aoDtoach 

1)  Integral  role  with  process.  Industry  tends  to 
focus  attention  on  those  documents  that  are  delivered  to 
the  customer,  but  each  and  every  artifact  of  the  process 
is  an  element  of  system  documentation.  Engineering 
drawings,  test  reports,  software  source  code, 
requirements  tracings,  defect  tracking  reports,  and  many 
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other  such  "documents'’  are  a  pan  of  what  we  do.  To 
date,  industry  has  focused  little  attention  on  the  integral 
role  of  capturing  ail  these  artifacts  as  pan  of  the  CBSE 
process.  Significant  change  will  occur  only  when 
industry  integrates  methods  and  tools  to  support 
improv^  processes  throughout  the  CBSE  life  cycle. 

2)  Designed  and  automated  approach.  In  pan, 
the  weak  role  that  process  automation  has  played  in 
CBSE  is  to  blame.  Without  necessary  databases  and 
automated  tools,  control  and  maintenance  of 
documentation  is  tedious,  repetitive,  and  prone  to 
human  error.  The  introduction  of  general  purpose  tools 
such  as  word  processors  and  electronic  spreadsheets  has 
contributed  to  improved  performance  in  this  area.  A 
"through-designed"  approach  is  needed  into  which  these 
general  purpose  tools  would  be  fiued. 

Interpersonal  Communication.  Table  10  summarizes 
the  state  of  interpersonal  communication. 


Table  10.  State  of  Interpersonal  Communication 


STANDARD 

PRACTICE 

ADVANCED 

PRACTICE 

RESEARCH 

NEEDED 

IjDiverse  team: 
Same  syntax/ 
different 
semantics 

IjConcurrent 

engineering 

4)Training 

2)Defined  roles 
3X:BSE  as 
integral  part  of 
systems 
engineering 

1)  Diverse  team.  Successful  application  of  the 
systems  engineering  process  hinges  on  the  abilities  of  a 
diverse  team  of  specialists  to  communicate  with  a 
common  viewpoint.  This  is  especially  true  when  the 
product  is  a  CBS,  as  diversity  of  backgrounds  adds 
complexity  to  the  communication  process  itself. 

2)  Undefined  process.  Software  engineers,  as 
detail  design  engineers,  perceive  that  their  interface 
with  systems  engineers  is  poorly  defined.  The  problem 
lies  in  a  lack  of  precise  definition  of  the  allocation 
process  and  the  failure  to  trace  software  specifications 
to  the  top-level  functionality  of  the  system. 

3)  Component  view.  Another  cause  of  difficulty 
is  the  view  that  software,  computer  hardware,  and 
communications  assets  are  component  pieces  of  the 
systems  engineering  discipline  rather  than  a  whole. 
Systems  engineers  find  it  difficult  to  break  out  of  this 
partitioning  paradigm,  and  software  engineers  are  not 
apprised  of  the  tradeoffs  and  design  decisions. 

4)  Cross-training  needed.  Academic  and  in¬ 
dustry  in-house  training  programs  for  systems  and 
software  engineering  must  take  a  more  interdisciplinary 
view,  share  some  common  courses,  and  work  toward 
the  develqpment  of  a  common  set  of  semantics. 


ACHIEVABLE  SYSTEMS  ENGINEERING 
IMPROVEMENTS 

Developing  and  using  complex  systems  requires 
major  capit^  invesunent.  Industry  must  make  invest¬ 
ments  in  large  systems  wisely,  with  thorough 
consideration  of  whk  the  system  is  to  do,  the  rewards 
for  doing  it,  and  an  investigation  of  feasibility.  The 
systems  engineer  must  ensure  that  he/she  has 
thoroughly  captured  user  needs  and  defined  a  feasible 
system  design  that  meets  all  system  requirements  and 
constraints. 

Industry  must  act  expeditiously  to  improve  systems 
engine^ing  by  advancing  CBSE  practice.  Corporations 
can  implement  each  of  the  following  suggestions  today. 

CBS  Modeling.  Various  modeling  approaches 
augment  text  specification  with  semantically  precise 
representations  for  engineering  information.  There  is  a 
gap  between  current  text-based  practice  and  future 
model-based  practice  that  must  be  closed  quickly  and 
economically.  Closing  the  gap  means  an  extensive 
culture  change  and  substantial  retraining. 

Retraining  efforts  must  target  methods  and 
notations  that  engineers  can  readily  learn,  as  retraining 
costs  usually  dominate  transition  costs  when 
implementing  new  methods.  If  at  all  possible,  new 
notations  and  methods  should  evolve  gracefully  from 
notations,  methods,  and  concepts  currently  in  use. 

A  Defined  Process.  The  engineering  process  includes 
a  sequence  of  process  steps,  and  policies  for  process 
control,  documentation,  and  staffing.  Industry  should 
define  the  process  in  layers  to  separate  engineering 
steps  from  alternative  methods  of  control,  docu¬ 
mentation  standards,  and  staffing  choices.  Layers  of 
description  may  include: 

1.  Description  of  process  steps. 

2.  Description  of  the  control  process  and 
organizational  groups  or  boards  that  perform  control. 

3.  Description  of  representations  used  for 
information  captured  at  each  step  of  the  process. 

4.  A  mapping  of  each  step  and  representation  onto 
tools  that  automate  that  step. 

5.  Description  of  the  staffmg  of  engineoing 
alternatives. 

6.  Description  of  the  review  process  and 
information  u^  for  review. 

Industry  should  base  the  systems  engineering 
process  on  a  set  of  models  that  define  engineering 
information  in  a  way  that  computers  can  capture  for 
interpretation,  execution,  consistency,  and  correctness. 
The  process  should  support  one-time  entry  of  in¬ 
formation,  both  for  new  desips  and  for  changes. 

Dynamic  Analysis  an<<  Simulation.  Industry  should 
capture  the  behavior  of  systems  in  a  representation  that 
can  be  executed  dynamically  for  analysis.  This  ability 
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gives  rigor  to  the  system  description.  Developers 
should  tqpply  the  same  representation  to  scenarios  that 
describe  how  the  system  interacts  with  its  environment 
In  addition,  industry  should  use  simulation  to  prove 
feasibility  and  develop  benchmarks. 

Performance  Requirements  Management. 
Traditionally,  engineers  budget  performance 
requirements  to  components  as  they  decompose  the 
system.  They  budget  quantities  such  as  weight,  power, 
heat  production,  heat  dissipation,  reliability,  lime 
response,  and  memory  size  from  a  parent  component  to 
subcomponents.  They  should  use  methods  and  tools  to 
track  these  budgets,  and  should  compare  design, 
simulation,  and  test  results  to  budgeted  specifications. 

Metrics,  Costing,  and  Tracking.  Metrics,  costing,  and 
tracking  are  essential  both  for  short-term  decisions  and 
for  long-term  continuous  process  improvement.  Well- 
denned  and  broadly  accepted  metrics  are  required  to 
quantify  the  work.  Industry  needs  cost  models  to  trans¬ 
form  these  metrics  into  numbers  for  particular 
applications.  Corporations  must  track  work,  both  to 
control  activities  and  to  calibrate  cost  models. 

Scalability.  Closing  the  gap  between  current  practice 
and  mature  practice  involves  serious  issues  of  proof  of 
scalability.  Engineers  must  try  new  methods  and  tools 
on  modest-sized  systems,  so  that  method  deficiencies 
can  be  eliminated  and  unanticipated  benefits  can  be 
incorporated. 

Major  Risk  Factors.  Investment  choices  are  difficult 
in  these  times  of  strong  competition.  Organizations 
must  make  decisions  at  a  time  when  there  is  no  solid 
baseline  of  methods,  metrics,  and  cost  for  the  systems 
engineering  fffocess.  They  can  estimate  anticipatkl  cost 
improvement,  but  cannot  derive  it  from  existing  data. 

CONCLUSIONS 

Indusuy  needs  a  CBSE  discipline  to  integrate 
component  views  during  specification  and  design  of 
critical  end-to-end  processing  flows.  Such  a  discipline 
would  encompass  knowledge  of  the  systems 
engineering  process,  software,  digital  and  analog 
processing  hardware,  and  communications.  The 
combined  discipline  is  necessary  for  making  proper 
CBS  design  tradeoffs,  decisions,  and  allocations.  Mtmy 
engineers  are  performing  CBSE  tasks,  but  there  are  no 
defined  positions,  tasks,  and  roles.  The  discipline  is 
necesstury  for  process  improvement,  and  for  fostering 
research  and  training. 
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Abstract 

This  paper  describes  the  DESTINATION  Interface  Specification  (DIS).  Design  Structuring  and 
Allocation  Optimization  (DESTINATION)  is  an  ongoing  research  project  at  the  Nava!  Surface 
Warfare  Center  (NSWC)  to  provide  a  new  methodology  for  design  optimization  and  t:ade  off 
analysis  of  real  time  systems.  The  need  for  DIS  arises  from  the  inherent  adaptiveness  of  the 
DESTINATION  system  to  a  wide  range  of  source  and  target  tools.  DIS  not  only  allows 
DESTINATION  to  coexist  with  various  systems  but  also  dictates  standards  for  a  comprehensive 
way  of  capturing  design  information.  The  basic  structure  of  DIS  reflects  a  method  of 
extracting/incorporating  design  information  that  is  otherwise  not  available  across  a  collection  of 
tools.  DIS  accommodates  the  identification  of  additional  design  information,  allowing  for 
customization  of  the  source  and  target  tools.  The  focus  of  the  DIS  research  and  development 
work  is  currently  in  the  area  of  system  logical  modelling  and  implementation  modelling. 

1.  Introduction 

One  of  the  primary  thrusts  behind  the  Systems  Design  Synthesis  project  of  the  NSWC’s 
Engineering  of  Complex  Systems  Technology  Block  Research  Program  (ECS)  is  to  provide  a 
new  methodology  for  systems  engineers  in  the  area  of  Design  Optimization  and  Trade-Off 
Analysis.  Systems  engineers  require  such  a  new  methodology  to  cost  effectively  construct  and 
maintain  increasingly  complex  mission-critical,  real-time  systems. 

Application  complexity  has  increased  not  only  due  to  functional  demands,  but  also  because  of 
technological  advances.  The  present  and  future  combat  systems  must  respond  to  an  expanding 
theater  of  commands,  as  well  as  the  requirement  to  perform  in  an  integrated  manner. 
Technologically,  the  advent  of  parallel  computers  and  high  speed  networks  opens  many 
opportunities  to  provide  greater  defense  capabilities.  These  functional  and  technical  factors 
greatly  increase  the  design  space  that  the  systems  engineer  must  explore  in  search  of  a  design 
that  satisfies  all  requirements.  The  idea  behind  design  optimization  and  trade-off  analysis  is  to 
provide  the  systems  engineer  with  the  necessary  tools  and  techniques  to  systematically  evaluate 
and  exploit  the  vast  design  space. 

DESTINATION  is  the  name  given  to  the  NSWC  research  effort  that  focuses  on  developing  the 
necessary  tools  and  techniques  to  support  such  a  methodology  for  design  optimization  and 
trade-off  analysis  [HoNH].  The  emphasis  on  this  project  is  design  structuring  and  resource 
allocation  tools  and  techniques.  The  design  structuring  involves  making  decisions  regarding 
decomposition/recomposition  and  fragmentalion/defragmentation  of  hierarchical  designs. 
Resource  allocation  includes  the  mapping  of  logical  design  objecLs  onto  implementation 
resources  in  a  near-optimal  manner. 

The  need  for  an  interface  specification  to  perform  design  optimization  first  arose  on  a 
predecessor  project  to  DESTINATION  called  EDA  or  Expert  Design  Advisor  (HoHN|.  The  lir.st 

version  of  the  interface  specification,  developed  for  EDA,  was  used  to  standardize  the  ffirmat  of 
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inpuls  lor  ihe  dcvclo.pincru  ul  lour  resouo-v  aiKivalion  optniu/aiioti  alg'>nlf!ir.,s  Gaiiu  (Fcaij, 
Dala-'OricnlcU.  Genetic  IDavidl  j  {Gold],  and  Sinuilaied  Annealime  {KKiVj.  Reuse  ot  ihe  same 
data  siruclures  reduces  the  si/e  ot  the  desGnpmeni  eltort  and  !ncfeasi.'s  the  ahilits  to  adapi  to 
new  algorithms  The  need  lor  the  exchange  i>!  mlonnation  l\-iwcea  various  itom  end  case  tools 
has  initialed  the  development  work  on  ^^a^dald^  tot  the  imeiiace  sfx-cdication  that  would 
enhance  portability,  adaptability,  nuunlainabditv  and  extensibility  tor  a  wide  lange  id 
source/target  tools 


To  belter  understand  the  luncnon  and  structure  oi  the  DhS  1  INAl  H  )N  Interlace  Speed  icatmn.  it 
is  necessary  to  turiher  explain  the  associated  DliSTlSAl  lON  methodology.  By  reviewing  the 
context  diagram  Ui  Figure  1.  we  can  see  the  DBS  riNAFU  >N  methodology  s  scofx*  and 
contribution  within  the  systems  !ile  cycle. 


Kiirurtl;  DKSTINATION  Conlecl  Diagram. 


The  human  .systems  engineer  plays  a  critical  role  within  the  methodology.  The  s\ stems 
engineer’s  input  is  fundamental  for  selecting  the  subject  ol  design  optimisation  and  evaluation, 
applying  analysis  techniques  and  interpreting  results  TTie  methodology  supports  the  systems 
engineer  by  making  characten/aiions  and  rccomnu'ndati<ins  The  systems  engineer  may  accept 
or  override  these  outputs. 


A  complete  .scenario  of  steps  can  be  mapped  into  the  context  diagram  to  descriK’  the 
methodology, 

1.  The  .systems  engineer  gives  a  directive  to  select  a  design  capture  view  for  analysis 

2,  DE.STINATIONI  interacts  with  the  .system’s  engineer  to  determine  the  lolliwing; 

a.  System  characterization. 

b.  Formulation  of  design  goals. 

c.  Application  of  design  rules. 

d.  Recommendation  for  use  of  simulaiiivn/opumiziition  tools  and  techniques. 

?i.  The  system’s  engineer  directs  DESTINATION  to  import  the  necessary  data  (or  the 
.selected  simulaiion/opiimi/aiion  tools  and  techniques 

554 


4.  Results  from  simulation/opiimizaiion  are  expoited  to  DESTINATION  for  evaluation. 

5.  When  satisfied  with  the  achievement  of  design  goals,  any  modification  to  the  design 
capture  views  are  imported  to  the  Design  Capture  system  to  maintain  consistency. 
Although  these  steps  are  listed  sequentially,  it  is  expected  that  there  will  be  a  high 
degree  of  iteration  within  and  among  these  steps,  particularly  when  performing  trade-off 
analysis. 

Many  approaches  have  been  followed  that  specify  what  to  capture,  such  as,  structured  analysis 
and  design  [WaMe],  object-oriented  design  [BOOCj,  and  how  to  capture  it.  One  of  the  most 
robust  approaches  to  be  defined,  which  is  consistent  with  earlier  DESTINATION  research 
efforts,  was  jointly  developed  by  NAVSWC  and  Trident  Systems  Corporation  ([Karcj,  [Hoan]). 
This  approach  to  forward  design  capture  represents  one  set  of  information  that  may  be 
incorporated  into  the  DESTINATION  methodology  for  analysis.  Though  DESTINATION  is  not 
restricted  to  any  particular  Design  Capture  approach,  the  NAVSWC/Trident  approach  is  one  of 
the  most  robust  and  places  the  strongest  demands  on  DESTINATION,  so  it  is  advantageous  to 
use  from  a  research  prospective.  Furthermore,  use  of  this  approach  insures  integrated  results 
within  the  System  Design  Synthesis  project  of  the  ECS  block  program. 

Basically,  the  forward  design  capture  accepts  design  information  according  to  three  models:  the 
conceptual  model,  the  logical  model,  and  the  implementation  model.  Each  is  described  below. 

The  conceptual  model  captures  the  operational  ideas  of  the  system  from  the  peiS;:  ctive  of  the 
operational  environment  and  information  modelling.  The  environmental  view  establishes  the 
conditions  and  environment  in  which  the  system  must  operate  including  a  de.scription  of  the 
system  architecture's  scope  and  boundaries,  test  plan,  and  operational  scenarios.  The  conceptual 
model  allows  the  system  engineerin'  team  and  the  customer  to  form  a  clear  understanding  of  the 
subject  system. 

The  logical  model  includes  a  description  of  the  functional  and  behavioral  views  of  the  system, 
without  regard  for  any  particular  implementation  decisions.  The  emphasis  within  this  design 
capture  model  is  on  what  the  system  should  do  as  opposed  to  how  it  should  do  it.  The 
behavioral  view  provides  an  understanding  of  the  system  from  a  dynamic  perspective. 

The  implementation  model  documents  the  hardware,  software  and  human  resources  which 
represent  a  particular  embodiment  of  the  system  under  design.  The  hardware  architecture 
describes  the  physical  resources  of  the  system  including  the  components,  interconnection 
topology  and  protocol,  and  rationale  for  selection.  The  software  architecture  describes  the 
Computer  Software  Configuration  Items  (CSCls)  and  the  executable  software  tasks  including  the 
messages  passed  between  tasks.  The  human  resource  description  includes  the  number  of 
personnel  required  to  operate  the  system  under  various  conditions  and  the  level  of  training  and 
experience  for  each  operator. 

There  is  no  restriction  on  what  design  methodologies  may  be  used  within  the  development  of 
any  of  the  three  models. 

Likewise,  any  number  of  simulation/optimization  tools  and  techniques  are  available  for  use 
within  DESTINATION.  Optimization  algorithms  that  may  be  applicable  for  use  include 
computation/communication-oriented,  genetic  search  and  simulated  annealing.  Simulation 
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techniques  that  may  be  interwoven  with  the  optimization  algorithms  include  petri-nei  simulation 
(e.g.  SES/workbench,  ADAS),  queueing  theory  fCCCC],  and  general  purpose  simulation 
languages  (e.g.  Simscript).  Future  advances  in  both  optimization  and  simulation  are  to  be 
expected. 

2.  Requirements  of  the  DESTINATION  Interface  Specification  (DIS) 

As  descrbed  in  the  previous  section,  the  DESTINATION  Interface  Specification  is  the  layer  of 
data  structures  and  export/import  routines  that  permit  application  information  to  flow  in  and  out 
of  DESTINATION.  From  one  perspective,  DIS  bridges  the  design  capture  facilities  with 
optimization  decisions  and  from  another  perspective,  DIS  integrates  the  execution  of  simulation 
models  and  optimization  algorithms  with  design  evaluation  and  recommendations.  This  iterative 
path  of  capturing,  modelling,  evaluating  and  recommending  becomes  significantly  more 
streamlined  by  having  a  consistent  and  robust  medium  of  exchange. 

A  number  of  requirements  impacted  the  development  of  DIS.  These  requirements  can  also  be 
viewed  as  motivating  factors  for  making  the  investment  in  DIS. 

1.  Tool  Independence 

It  is  desirable  to  have  a  methodology  be  independent  of  a  particular  toolset.  There  may 
be  financial  and  training  constraints  that  oppose  acquiring  a  toolset,  particularly  when 
a  comparable  one  may  already  be  in  place.  In  the  context  of  design  capture,  for  instance, 
there  are  several  Front-End  Computer-Aided  Software  Engineering  (FE-CASE)  tools  that 
are  operational  within  Department  of  Defense  (DOD)  programs,  most  notably.  Cadre’s 
Teamwork  and  IDE’s  Software  Through  Pictures  (StP).  The  interface  to  DESTINATION 
should  handle  data  from  Teamwork  just  as  easily  as  data  from  SlP. 

Certainly,  as  part  of  the  access  routines  into  the  FE-CASE  system’s  repository,  there  will  be 
some  effort  that  is  not  reusable.  The  goal  is  to  minimize  this  effort.  Figure  2  shows  how 
this  is  done  in  the  context  of  interfacing  with  Cadre’s  Teamwork. 
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In<)cxes 


Design  Opfimaattoo 


P'igure  2:  Design  Capture  Interface, 

Two  capahililies  are  shown  in  Figure  2.  The  first  capability  provides  the  systems  engineer 
with  the  capability  to  supplement  the  design  capture  process  with  information  for  design 
optimization.  There  are  three  types  of  design  optimization  information  (DOl)  that  have  been 
included  as  part  of  the  design  capture  supplement:  vSystcm  Design  Factors  [HoNH],  data 
required  by  .)pti..iization  algorithms  and  hardware  resource  descriptions  and  characteristics. 
The  second  capability  extracts  infomiation  stored  directly  in  the  Front-End  CASE  system’s 

repo.sitory  re pre .sen tins  the  information  ctmtained  within  the  CASE  graphics  (bubbles,  flows, 
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connections,  etc.)  The  two  information  sources,  the  Front-End  CASE  dependent  data  and  the 
Front-End  CASE  independent  DOI,  are  then  merged  to  create  a  DIS  compatible  file. 

A  similar,  though  possibly  more  complicated,  choice  of  tools  to  develop  interfaces  exists 
within  the  Simulation  System/Oplimization  Technique  domain  as  in  the  FE-CASE  area. 

2.  Implementation  Independence 

The  DESTINATION  toolset  contains  many  interconnected  subsystems,  such  as  for  design 
characterization,  design  evaluation,  and  for  making  recommendations  regarding  design  struc¬ 
turing  and  resource  allocation.  Development  of  these  subsystems  can  proceed  more  indepen¬ 
dently  by  sharing  the  DIS  among  them. 

Furthermore,  developers  of  algorithms  for  resource  allocation,  scheduling,  and  design 
structuring  can  utilize  the  DIS  as  a  departure  point  for  their  innovation.  DESTINATION 
then  provides  a  convenient  proving  ground  for  determining  the  situations  where  the 
algorithm  performs  best.  This  makes  for  a  win-win  situation  for  DESTINATION  and 
algorithm  developers:  there  will  be  an  increased  likelihood  that  their  algorithms  will  be 
transferred  to  practical  use  and  likewise  DESTINATION’S  library  of  algorithms  on  which  it 
bases  its  optimization  recommendations  will  progressively  expand.  It  is  expected  that  the 
algorithm  developers  will,  in  turn,  uncover  additional  requirements  for  the  DIS  and  through 
feedback  DIS  will  progressively  improve. 

3.  Supports  Incomplete  Information 

If  optimization  is  to  be  performed  on  a  boiiom-up  basis,  there  may  be  substantial  informa¬ 
tion  that  may  not  have  been  provided  on  a  higher  level.  To  proceed  under  these  circum¬ 
stances,  default  values  can  be  associated  with  the  DIS  data  structures  and  be  supplied 
as  required  by  the  optimization  algorithms. 

4.  Gain  Wide  Acceptance 

DIS  must  be  designed  so  that  it  can  be  used  widely,  as  described  in  Figure  1,  by  systems 
engineers,  algorithm  developers,  tool  vendors  and  standards  bodies. 

5.  Transportability 

DIS  should  facilitate  the  transportability  of  design  capture,  design  optimization  and  simula¬ 
tion  information  from  one  computer  environment  to  another. 

6.  Uniformity  and  Cohesiveness 

The  DIS  model  should  be  simple  and  uniform,  while  minimizing  the  amount  of  concepts, 
types  and  classes  of  operations. 

7.  Implementability 

Vendors  of  simulation  systems,  front-end  CASE  systems,  and  algorithm  developers  should 
be  able  to  utilize  DIS  with  only  a  reasonable  effort.  The  design  of  DIS  should  allow 
for  flexibility  in  implementation  while  maintaining  consistent  operational  semantics. 

8.  Extensibility 

As  mentioned  above,  tool  vendors,  algorithm  developers  and  systems  engineers  will  uncover 
additional  requirements  on  DIS.  DIS  should  not  preclude  any  extensions  to  its  scope 
to  satisfy  evolving  needs. 

9.  Performance 

The  DIS  design  must  allow  for  efficient  operation  from  both  external  access  of  design 
capture  and  simulation  systems  as  well  as  from  internal  DESTINATION  procedures. 
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To  satisfy  these  requirements,  several  basic  design  decisions  for  DIS  were  made. 

1.  Represent  the  data  structures  for  easy  mapping  onto  a  flat  ASCII  file.  This  accommodates 
the  requirements  for  acceptability,  transportability,  implementabiliiy,  extensibility,  and  per¬ 
formance. 

2.  Utilize  Ada  as  the  language  for  formally  specifying  DIS.  There  were  several  underlying 
reasons  for  this: 

a.  Ada  is  a  DOD  standard. 

b.  Ada  is  widely  available  on  many  computing  platforms. 

c.  Ada’s  package  facilities  and  specification/body  separation  could  be  used  to  express 
multiple  layers  of  abstraction. 

d.  Ada  was  very  successful  in  its  use  as  a  specification  language  for  the  Ada  Semantic 
Interface  Specification  (ASIS)  definition  [BlSp].  ASIS  is  a  vendor-independent, 
non-proprietai7  bridge  between  Ada  libraries  and  Ada  tools. 

There  are  a  number  of  alternative  methods  for  specifying  the  interface.  English  was 
dismissed  as  being  two  ambiguous.  Use  of  formalisms  like  the  Backus  Naur  Form  (BNF) 
and  Extended  Backus  Naur  Form  (EBNF)  has  the  advantage  of  concise  accuracy  allowing 
little  room  for  ambiguities  and  vagueness  but  does  not  allow  high  level  representation.  C 
was  not  used  because  of  its  lack  of  abstraction  facilities.  C++  and  Common  Lisp  Object 
System  (CLOS)  are  viable  alternatives,  particularly  due  to  their  strong  object  orientation  and 
inheritance  facilities.  Presently,  they  lack  the  standardization  and  DOD  acceptance  of  Ada. 
There  are  systems  and  associated  languages  that  specialize  in  interface  definition  and 
actually  automatically  generate  some  code  necessary  for  declaring  and  accessing  the 
interface  [NEST].  These  facilities  warrant  further  investigation,  but  are  not  DOD  standards. 


There  are  a  number  of  potentially  useful  integration  standards  that  are  emerging,  such  as 
PCIS,  Case  Integration  Services  (CIS),  IRDS,  CASE  Document  Interface  Format  (CDIF), 
IEEE-P1175  and  NGCR’s  PSSWG  [StSh].  None  of  these  efforts,  however,  are  directly 
working  in  the  area  of  design  optimization  and  lack  representation  of  much  needed 
information.  Through  planned  participation  with  these  working  groups  and  standards  bodies 
DIS  should  beneficially  impact  these  efforts. 

3.  Components  of  the  Interface  Specification 
3.1.  Overview 

The  current  version  of  DIS,  2.0,  is  divided  into  several  packages  at  its  top  level.  Each  of  these 
packages  is  comprised  of  lower  level  packages.  This  basic  structure,  at  present,  is  shown  in 
Figure  3. 
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DESTINATION  Interface  Specification 


Figure  3:  DESTINATION  liiterfave  Spvcirication. 


The  following  section  describes  the  implementation  model  in  greater  detail  with  brief 
descriptions  of  the  structural  components  and  the  Ada  data  type  declarations  for  some  of  the 
major  components.  This  is  where  the  emphasis  of  the  current  effort  has  been  Development  of  a 
technical  report  describing  the  other  packages  is  in  progress. 

3.2.  Implementation  Model 

The  implementation  model  contains  data  representations  for  making  and  analyzing  decisions  for 
resource  allocation  and  design  suocturing.  Representations  are  needed  for  the  following 
information: 

1.  A  task  graph  to  depict  the  candidate  software  configurations. 

2.  A  resource  graph  to  depict  the  candidate  hardware  configurations. 

3.  Constraints  derived  from  requirements.  Constraints  are  presently  divided  into  two  catego¬ 
ries:  placement  constraints  and  timing  constraints.  These  constraints  impact  the  effective¬ 
ness  of  optimization.  Each  one  of  these  types  of  constraints  contain  several  sub-types 
of  constraints  that  will  be  described  further  below. 

The  implementation  model  of  a  real-time  system  consists  of  one  or  more  implementation  views 
(i.e.,  a  list  of  implementation  views).  The  Ada  type  declaration  for  this  information  in  the  DIS  is 
shown  in  Figure  4.  Each  implementation  view  consists  of  a  software  structure  diagram,  a 
hardware  structure  diagram  and  a  list  of  mappings  of  software  components  to  hardware 
resources. 


type  DIS_linpleinentation_view_type  Is 
record 

DIS_sw_8tructure_diaoram 

DIS_sw_8tructure_diagrain_ptr; 

DIS_hw_8tructure_diasrrain 

DIS_hw_8tructure_diagrain_ptr ; 

DIS_iinpleinentation_inappino_li8t 

DIS_inapping„view_ptr ; 

DIS_liapleinentation_view_next 

DIS_iinplen»entat  ion_view_ptr ; 

DIS_iinpleinentation_view_previous 
end  record; 

DIS_iinp  lementat  ion_view_ptr ; 

Figure  4:  Implementation  View  Components.  DlS_Implementation_view_previews 
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The  term  structure  diagram  is  used  to  reference  a  collection  of  directed  graphs,  drawn  with 
respect  to  a  selected  methodology,  that  captures  information  about  a  set  of  components  and  their 
relations  along  with  any  hierarchical  decomposition.  For  example,  a  tree  of  data  flow  diagrams 
may  be  considered  as  one  type  of  structure  diagram. 

The  list  of  mappings  is  provided,  since  for  the  same  software  and  hardware  structure  diagrams, 
we  may  apply  different  allocation  tools  to  determine  possible  mappings.  Each  of  the 
components  of  the  implementation  view  are  further  described  below. 

To  better  explain  the  data  structures  represented  within  the  implementation  model,  graphic 
figures  have  bee.,  provided.  Figure  5  contains  a  legend  of  the  graphical  notations.  In  Figures  6, 
10,  and  11,  there  are  three  types  of  edges  connecting  the  data  structure: 

1 .  A  points  to  relation  to  indicate  that  a  structure  contains  a  variable  which 
points  to  another  data  structure.  There  are  two  types  of  points  to  relations 
(edges).  The  first  type  represents  decomposition  through  a  contains  or  is  con¬ 
tained  by  relations  depending  on  the  direction  of  the  arrow.  Typically,  the 
contains  relation  points  downward  (vertically)  on  the  page  and  the  contained 
by  points  upward  (vertically)  on  the  page.  The  decomposition  implies  that 
when  a  data  structure  contains  another  data  structure,  the  former  data  struc¬ 
ture  may  be  viewed  as  the  parent  and  the  latter  is  called  a  child.  The  second 
type  denotes  a  has  relation  and  does  not  involve  a  notion  of  decomposition 
into  child  entities. 

2.  A  references  relation  when  two  structures  show  a  common  data  element. 

3.  An  is  linked  to  relation  showing  that  the  data  structure  is  part  of  a  linked  list. 

For  each  of  these  relations  there  are  two  pointers — one  for  the  next  occur¬ 
rence  in  the  list  and  one  for  the  previous  occurrence  in  the  list.  This  double 
linked  list  structure  allows  for  reduced  programming  in  traversing  the  list. 

3,2.1.  Software  Structure 

Figure  6  represents  the  software  structure  architecture.  The  data  types  for  the  software  structure 
is  shown  in  Figure  7.  Each  software  structure  diagram  is  represented  by  a  list  of  modules  and  a 
list  of  edges  between  modules.  A  module  represents  a  collection  of  nodes  and  edges  within  a 
structure  diagram  (as  explained  above).  Again,  using  the  data  flow  diagram  as  an  example,  a 
module  could  be  considered  as  a  level  of  decomposition. 
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Data  Structure  (shading  implies  a  list) 


►  Contains  relation 


{>  Has  relation 


Is  linked  to  relation  (including,  next  and 
previous) 


> 


References  relation 


Figure  5;  Nodes  and  Eklges  Used  to  Describe  DIS  Components, 
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Figure  6:  Software  Structure  Architecture 
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type  DIS_sw_8tructure_diagr2un  is 

record 

sw_structure_diagrain_narae 

DIS_8w_diagr^un_n^une_type ; 

8W_structure_diagrain_id 

DIS_8w_diagrain_id_type ; 

8w_module_list 

DIS_sw_module_ptr; 

8 w_modu 1 e_edge_ list 

DIS_sw_module_edge_ptr ; 

8W_structure_diagram_next 

DIS_8W_structure_diagrain_ptr; 

8  w_d  iagrain_pre  vi  ous 

DIS_sw_structure_diagrajn_ptr ; 

parent_iinplementat  ion_view 

DIS_implementation_view_ptr : 

end  record; 

Figure  7:  Software  Structure  Data  Type. 


Software  modules  can  be  nested  and  each  module  includes  its  own  task  graph.  Task  graphs 
cannot  be  nested  since  the  node  of  a  task  graph  cannot  be  a  module.  However,  nested  relations 
between  tasks  can  be  captured  using  nested  modules.  Our  view  is  that  the  task  represents  a 
separately  executable  computational  entity. 

A  software  module  contains  the  following  information: 

1.  The  hierarchical,  sibling  and  nesting  relations  between  modules. 

2.  The  identity  of  task  graphs  that  belong  to  the  module.  In  addition,  there  are  two  special 
kinds  of  edges  (called  entry_super_edge  and  exit_super_edge).  They  are  used  to  identify 
the  entry  and  exit  points  of  the  task  graph  at  the  module  level. 

The  data  type  for  a  software  module  is  shown  in  Figure  8. 


type  DIS_8w_inodule_type  is 

record 

inodule_id 

DIS_sw_module_id_type ; 

modu le_name 

DIS_name_type ; 

parent_8w_structure 

DIS_8w_8tructure_diagram_ptr; 

parent_module 

DIS_8w_module_ptr ; 

next_module 

DIS_sw_roodule_ptr; 

previou8_jnodule 

DI S_sw_modu le_pt r ; 

8Ubmodule_list 

DIS_sw_module_list_ptr; 

—  define  links  between 

super  edges  of  the  submodules . 

8W_module_edge_l i at 

: 

DIS_sw_module_edge_ptr ; 

—  Task  graph  belonging 

to  this  module. 

ta8k_node_li8t 

DIS_ta8k_node  _ptr; 

ta8k_edge_li8t 

Dl S_t  a  Bk_edge_pt  r ; 

entry_super_edge_list 

DIS_ta8k_edge_ptr; 

exit_super_edge_l i s t 

DIS_task_edge  _ptr; 

end  record; 

Figure  8;  Software  Module  Data  type. 

A  task  graph  is  a  directed  graph:  each  node  denotes  a  schedulable  computational  entity  and  an 
edge  represents  a  precedence  relation  between  two  nodes.  For  each  task  node  in  the 
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DlS_task_node  structure,  there  is  a  taskjnpuijist  to  identify  input  data  and  task_outputJist  to 
identify  output  data  generated  by  the  task.  In  addition,  task _jyredecessor_Hst  identifies  tasks  that 
execute  before  the  task  and  taskjsuccessorjist  identifies  tasks  that  execute  after  the  task.  There 
is  an  and_or  flag  associated  with  the  above  four  task  lists  that  specifies  whether  all  input  (or 
output)  data  are  needed  (or  generated)  by  the  task.  This  information  is  required  by  some 
optimization  algorithms.  Each  task  may  include  timing  information  such  as  ready  time,  deadline 
and  duration.  In  addition,  it  identifies  resources  it  needs.  For  resource  needs, 
DlSjresourcejype  identifies  the  resource  a  task  needs  and  the  amount  it  needs.  For  each  task 
edge,  task^edgejdata  identifies  the  data  associated  with  the  edge  along  with  the  duration  of 
availability  of  the  data.  In  addition,  fwm_task_node  and  to_task_node  specifies  the  source  and 
destination  of  the  edge. 

The  declarations  for  the  task  node  data  structure  and  the  task  edge  data  structure  are  shown  in 
Figure  9. 
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type  DIS_ta8k_node  is 
record 

task_id 
task_neune 
task_8tructure 
task_de script ion 
—  Task  data  dependencies. 
ta8k_input_and_or 
task_input_list 
ta8k_output_and_or 
task_output_list 
—  Task  precedence  relations. 
ta8k_predeces8or_and_or 
ta8k_predecesaor_list 
task_succes8or_and_or 
task_succe8sor_li8t 
—  Timing  information, 
t  ask_ready_t  ime 
ta8k_deadline 
ta8k_duration 
—  Resource  needs . 

task_resource_needs 
task_max_replication 
task_buddy_taak 
ta8k_priority 
ta8k_execution_probability 
ta8k_conimunication_delay_matrix 
t  a  sk_er ror_c  umu 1 a t i on 
ta8k_iroprecise_error_convergence 
task_next 
task_previous 
end  record; 

type  DIS_ta8k_edge  is 
record 

t  ask_edge_id 

task_edge_data 

from_ta8k_node 

to  t'iak_node 

next_taak_edge 

previous_task_edge 

end  record; 


:  DIS_task_id_type; 

:  DIS_n2une_type; 

:  DIS_task_8tructure_type ; 

:  DIS_task_description_type; 

:  DIS_and_or_type ; 

:  DIS_data_type_ptr; 

:  DIS_and_or_type; 

;  DIS_data_type_ptr; 

:  DIS_and_or_type ; 

:  DIS_ta8k_node_ptr; 

:  DIS_and_or_type; 

:  DIS_ta8k_node_ptr; 

;  DIS_tima_type; 

:  DIS_time_type ; 

:  DIS_time_type; 

:  DIS_resouroe_ptr; 

:  DIS_ta8k_count_type; 

;  DIS_task_node_jptr; 

:  DIS_task_priority_type ; 

:  DIS_task_e_probability_type; 
;  DIS_taBk_coinm_matrix__ptr; 

:  DIS_ta8k_error_type; 

:  DIS_ta8k_error_type; 

:  DIS_taBk_node_ptr; 

;  DIS_taBk_node  ptr; 


DIS_ta8k_edge_id_type ; 

DIS_data_type_ptr; 

DIS_ta8k_node  _ptr; 

DIS_taBk_node_ptr; 

DIS_task_edge_ptr; 

DIS_taBk_edge_ptr; 


Figure  9:  Task  Node  and  Edge  Structure. 
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3,2.2.  Hardware  Structure 

A  hardware  structure  diagram  defines  a  hardware  coniiguraiion,  A  hardware  crmhguranon  is 
viewed  as  consisting  of  hardware  nodes,  cimnectcd  by  hardware  links,  hath  node  is  iecursivel> 
viewed  as  consisting  of  internal  nodes  that  are  connected  by  internal  links  The  architecture  oi 
the  hardwaie  structure  diagram  is  shown  in  Figure  10. 


I 

I 

I 

I 


Figure  HI:  Hardware  Structure  Architecture 
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3.2.3.  Mapping  Structure 

A  mapping  assignmeni  consists  ol'  mapping  consirainis  and  tusk  assignments.  Figure  11 
represents  the  mapping  structure  architecture.  There  are  two  tvfies  ol  mapping  constraints: 
liming  constraints  and  placement  constraints.  Each  mapping  constraint  includes  a  preference 
value  that  specifies  the  importance  of  meeting  the  mapping  conslraii.:;  the  magnitude  of  the 
value  rellects  its  importance. 

The  data  structure  for  mapping  con.siraint  is  shown  in  Figure  12.  There  are  four  kinds  of  timing 
constraints,  each  liming  constraint  is  defined  on  a  set  of  tasks: 

1.  complete _within  t  means  that  the  set  of  tasks  should  complete  within  i  time 
units  of  each  other. 

2.  siart_within  i  means  that  the  set  of  tasks  should  start  within  t  time  units  of 
each  other. 

3.  complete jMth_within  t  means  that  the  sequence  of  the  set  of  tasks  should  com¬ 
plete  within  t  time  units  from  the  beginning  of  the  .sequence. 

4.  complete _stait _within  t  for  two  tasks,  A  and  B,  means  that  B  should  start  within 
t  after  the  completion  of  A. 

There  are  three  kinds  of  placement  con.siraini.s,  each  placement  constraint  is  defined  on  a  .set  of 
tasks: 

1 .  plucejof^ether  means  that  the  set  of  tasks  should  be  assigned  to  the  .same  hard¬ 
ware  component. 

2.  place _separate  means  that  the  .set  of  la.sks  should  be  assigned  to  different  hard¬ 
ware  component. 

3.  place_at  means  that  the  .set  of  ia,sks  should  be  a.s.signcd  at  a  particular  hardware 
component. 
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Contains 

DIS_swJ(j_ 

list_lype 

1 

DlS.hwJd. 

Iist_lype 

.  ^ - 

■References 


Fii’urt!  11:  Mapping  Structure  Architecture. 


type  DIS_inapping_constraint  is 
record 

timing_con8traint  :  DIS_t_con8traint; 

placement_constraiiit  :  DIS_p_con8traint; 

parent_inapping_view  :  DIS_mapping_view; 

end  record; 

type  DIS_t_conBtraint_kind_type  is  (complete_within,  8tart_within, 
coinplete_path_within,  coinplete_start_within)  ; 

type  DIS_t_conatraint_type  is 
record 

t_con8tra int_kind 
preference_value 
tiine_value 
aof  tware_id_l i a t 
parent_mapping_constraint 
next_t_constraint 
previou8_t_conatraint 
end  record; 

type  DIS_p_con8traint_kind_type 
place_at) ; 

type  DlS_p_constraint_type  is 
record 

p_constraint_kind 
preference_value 
hardw_id 
softw_id_li8t 
parent_mapping_con8traint 
next_p_con8traint 
previou  8_p_c  ons  t  ra int 
end  record; 


Figure  12:  Mapping  Constraints  Data  Declarations 

A  task  assignment  is  the  result  of  running  an  allocation  algorithm  on  a  set  of  software  and 
hardware  nodes  with  a  set  of  timing  and  placement  consu-aints. 


:  DIS_t_con8traint_kind_type; 

;  Dls_preference_range_type; 

:  DlS_time_type; 

:  DIS_8oftw_id_list; 

:  DIS_mapping_constraint_type; 

:  DIS_t_eon8traint_type; 

:  DIS_t_constraint_type; 

is  (place_together,  place_separate, 


DIS_p_constraint_kind_type ; 
DIS_pref erence_range_type ; 
DIS_hardw_id_type ; 

DIS_Bof tw_id_list ; 
PIS_inapping_constraint_type; 
PIS_p_constraint_type ; 
DIS_p_con8traint_type ; 


4.  Future  Directions  and  Conclusions 

The  development  of  DIS  is  in  the  first  year  of  an  on  going  effort.  There  are  several  tasks  that  are 
planned  for  future  development.  These  are  elaborated  below. 

1.  Library  of  services 

In  addition  to  .setting  up  standards  for  data  structures,  DIS  should  also  provide  operations 
such  as  load/unload,  add,  delete,  update  and  queries. 
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2.  Importing  results  to  design  capture 

Provide  feedback  functions  to  incorporate  the  results  of  the  optimization  back  into  the 
design  without  modifying  the  structural  components. 

3.  Exporting  results  from  Simulation/Optimization  systems 

Interface  specifications  components  can  represent  results  from  simulationyoptimization  sys¬ 
tems.  This  should  not  only  enhance  the  process,  but  also  provide  a  standard  data  representa¬ 
tion  for  developing  extract  functions  for  the  target  systems. 

4.  Alternative  representation  languages 

Use  of  high  level  object-oriented  languages  to  represent  DIS  has  the  advantage  of  inheri¬ 
tance,  allowing  a  higher  degree  of  reusability. 

Additionally,  now  that  several  versions  of  DIS  have  been  produced  and,  as  it  becomes  more 

robust,  participation  in  various  standards  organizations  will  begin.  Involvement  in  the  CASE 

Document  Interchange  Format  Working  Group  is  expected  to  begin  by  third  quarter  of  1992. 
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Abstract 

This  paper  will  address  issues  related  to  the  integration  of  the  systems  evaluation  methodologies  including 
background,  problems,  and  possible  solutions.  As  systems  and  their  development  become  more  complex,  it 
becomes  critical  that  evaluations  and  assessments  of  the  system’s  behavior  become  an  integral  part  of  the  system 
development  life  cycle.  An  integration  of  the  evaluation  methods  into  the  design  capture  will  allow  more  of  the 
assessments  to  become  a  part  of  the  design  decisions  of  the  system. 


Introduction: 

Current  state-of-the-practice  of  the  modeling  and  analysis  methodologies  provides  fragmented  use  of  the 
capabilities  that  are  available  to  the  development  cycle.  Modeling  and  evaluation  capabilities  are  not  as  closely 
tied  to  the  core  of  the  design  cycle  as  needed  to  provide  a  seamless  path  between  design  synthesis  and  system 
evaluation.  A  designer  who  wants  to  test  a  certain  part  of  the  system  (or  some  aspect  of  the  whole  system) 
resorts  to  the  evaluation  capabilities  to  make  an  assessment  of  the  system.  To  date,  however,  there  are  no 
formal  techniques  to  apply  the  information  that  is  obtained  from  the  assessment  to  the  design  specification 
[CH091].  There  needs  to  be  a  better  information  flow  between  design  capture  and  performance  evaluation 
models.  Although  the  narrow  definition  of  performance  does  not  include  reliability,  availability,  and  security, 
this  paper  addresses  these  issues. 

Currently,  there  is  a  large  gap  between  the  representations  of  design  capture  and  performance  evaluation 
models,  although  there  has  been  much  research  that  has  tried  to  bridge  this  gap  (such  as  Teamwork  and  ADAS, 
STP  and  SES  Workbench).  There  are  standards  efforts  like  CDIF  that  are  addressing  the  interchangeability 
among  CASE  tools  that  may  serve  as  an  intermediate  format  between  CASE  representations  and  performance 
evaluation  models.  These  efforts  are  immature  and  do  not  address  semantic  issues  for  the  most  part. 

One  of  the  many  problems  with  not  having  integrated  capabilities  in  complex  system  development  stems 
from  the  fact  that  typically  several  contractors  work  on  large  projects.  Not  all  of  these  contractors  will  use 
compatible  tools  or  methods.  This  type  of  a  problem  usually  results  in  inconsistent  design  specifications  and 
incoherent  system  designs. 

Within  the  performance  evaluation  environment,  the  size  of  the  system  under  development  (SUD) 
becomes  a  quick  limiting  factor.  The  number  of  states  in  the  state  dependent  representation  will  typically 
increase  at  an  exponential  rate  to  the  number  of  nodes.  There  have  been  many  proposals  that  have  tried  to 
address  this  issue  starting  from  having  a  hierarchical  representation  of  a  ^havior  to  a  partitioning  of  the  SUD. 
Another  possible  solution  may  be  to  apply  a  sensitivity  analysis  early  in  the  development  cycle  to  identify  system 
parameters  that  greatly  affect  the  performance  with  even  the  slightest  change.  In  effect,  a  sensitivity  analysis 
identifies  critical  areas  of  the  system  that  deserve  a  more  detailed  evaluation. 


575 


Large  complex  computer-based  systems  are  typically  characterized  as  discrete  event  dynamic  systems 
(DEDS).  Since  there  is  no  single  paradigm  for  DEDS,  each  of  the  modeling  techniques  offers  advantages  to 
certain  types  of  analyses,  but  not  to  all.  Just  as  a  design  specification  needs  multiple  views  to  provide  a 
comprehensive  description  of  the  system,  a  performance  evaluation  needs  to  use  multiple  models  to  provide  a 
comprehensive  assessment  of  the  system.  For  example,  finite  state  machines  may  be  suitable  for  some 
communication  protocol  specification,  but  are  not  suitable  when  unbounded  queues  are  present.  Some  claim 
that  a  Petri-net  can  be  used  to  represent  any  system.  While  this  may  be  true,  one  ean  use  other  modeling 
approach  to  represent  any  system,  although  the  representation  may  be  so  complex  that  it  is  incomprehensible 
to  everyone  except  to  that  designer.  The  suitability  of  the  modeling  techniques  to  perform  certain  assessment 
is  the  salient  requirement,  making  maximum  use  of  the  inherent  characteristics  of  the  modeling  technique.  The 
ease  of  use  (of  representation  and  analysis)  and  available  analysis  capabilities  may  make  it  more  beneficial  to 
use  one  modeling  approach  versus  another.  Therefore,  it  is  apparent  that  to  make  optimum  use  of  the 
evaluation  capabilities,  one  needs  to  employ  more  than  one  model  to  represent  and  evaluate  the  SUD  [GOE91J. 

The  following  sections  discuss  some  of  the  ways  that  current  research  efforts  have  tried  to  address  the 
problems  and  issues  discussed  above.  The  first  section  discusses  the  use  of  multiple  models  and  their  integration 
into  the  design  process.  The  second  section  discusses  the  integration  of  performance  evaluation  models  and 
design  capture  representations.  The  third  section  discusses  the  long-term  goal  of  unifying  the  design  capture, 
performance  models,  and  other  specifications  (such  as  requirements,  implementations,  etc.). 

Transformation  Among  Models: 

DEDS  do  not  have  a  single  paradigm  to  represent  the  system.  Each  model  that  has  been  developed  has 
certain  advantages  and  disadvantages.  A  systems  engineer,  however,  needs  information  about  the  behavior  of 
the  system  that  typically  requires  more  than  a  single  model. 

One  near-term  solution  to  the  current  limitations  of  a  single  model  may  be  to  use  multiple  models  as 
multiple  "views"  of  the  system-just  as  design  capture  'oses  multiple  views  to  represent  the  design  of  the  system. 
Then,  as  is  the  case  in  design  capture,  traceability  and  a  consistency  become  salient  and  critical  features. 

Each  "vnew”  has  a  set  of  evaluation  criteria  or  metrics  that  it  determines  when  evaluation  methods  are 
applied  to  the  representation.  For  example,  semi-Markov  reward  models  can  be  used  to  perform  reliability  and 
availability  evaluation  of  the  system  [TRI91)  or  a  stochastic  petri-net  to  evaluate  the  dynamics,  leadlock,  livelock, 
and  reachability  of  the  system.  Each  "view”  needs  to  capture  just  enough  information  about  the  system  to 
perform  its  assessments;  hence,  some  of  the  overhead  of  carrying  unnecessary  information  is  reduced.  Within 
this  framework,  each  "view"  has  to  be  consistent  with  other  "views."  A  value  of  the  parameter  in  one  "view"  must 
be  the  same  in  the  other  "view,"  or  an  architecture  in  one  model  must  be  the  same  as  that  of  the  others.  To 
provide  consistency,  information  of  one  model  must  be  able  to  be  traceable  to  that  of  the  other.  This  is  a 
difficult  problem  because  the  appearance  of  the  information  may  differ  greatly  among  models  (single  parameters 
in  one  model  may  split  and  distribute  across  the  other  models). 

The  first  problem  within  the  transformation  among  models  is  that,  even  within  performance  models,  there 
are  differences  in  the  level  of  information  detail.  Some  models  arc  used  to  evaluate  the  steady  state  behavior 
of  a  system  and  queuing  characteristics,  while  others  simulate  the  system  performing  functions.  Some  of  the 
differences  in  the  level  of  information  are  due  to  the  definition  of  the  system  by  the  user  and  others  are  due  to 
the  limitations  of  the  modeling  scheme  itself. 

The  second  problem  with  the  transformation  among  the  models  is  that  it  is  typically  not  a  one-to-one 
mapping.  Therefore,  when  one  model  is  transformed  to  another  model  and  is  transformed  back  to  the  original 
modeling  technique  without  modifications,  this  transformed  model  may  look  totally  different  from  the  original 
model  although  be  mathematically  equivalent.  There  are  also  semantic  differences  among  the  performance 
models  that  make  the  transformations  difficult. 

A  long-term  solution  to  maximizing  the  use  of  the  performance  evaluation  models  may  be  to  develop  an 
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intermediate  form  of  representation  of  the  performance  models  (which  is  similar  in  theory  to  the  intermediate 
form  of  CASE  representation).  This  intermediate  form  may  just  be  a  way  to  collect  enough  data  for  the 
transformation  into  many  performance  models,  or  it  may  be  an  extension  of  any  one  of  the  modeling  techniques. 
One  danger  of  having  a  single  representation  is  that  the  evaluative  and  expressive  powers  of  a  modeling 
technique  may  be  lost  at  the  expense  of  making  extensions  to  the  existing  model.  Another  danger  is  information 
explosion  may  occur  by  trying  to  add  so  much  into  this  intermediate  form. 

Transformation  Between  Capture  Methods  and  Performance  Evaluation  Models: 

A  transformation  techiuque  between  a  design  capture  representation  and  a  performance  evaluation  model 
attempts  to  use  the  performance  model  as  a  design  specification  as  well  as  a  guide  in  the  design.  Many  design 
decisions  are  made  in  the  evaluation  and  assessment  stages  of  the  system  development  life  cycle,  therefore,  there 
needs  to  be  a  closer  linkage  between  a  design  capture  and  performance  model  representations.  A  bi-directional 
transformation  between  design  capture  and  performance  model  would  provide  a  means  to  validate  the  evaluation 
results. 


Our  work  in  this  area  attempts  to  bridge  the  gap  among  the  currently  developing  multi-paradigm  views 
of  the  system  [HOA91],  including  a  resource  library,  a  resource  allocation  and  optimization  tool  (NGU91],  and 
an  intermediate  form  of  a  performance  model.  Problems  that  occur  due  to  differences  in  the  representation 
include  semantic  inconsistencies,  lack  of  information,  and  ambiguities. 

Like  most  new  tools  and  methodologies,  design  capture  had  a  basic  structure,  and  additions  were  made 
to  those  structures  to  answer  other  questions  or  to  address  inadequacies  (such  as  real-time  extension).  The 
objective  of  design  specification  tools,  however,  differs  from  that  of  evaluation  tools  and  therefore  information 
that  is  contained  in  the  design  capture  differs  from  that  of  performance  evaluation  models.  As  such,  there  is 
some  information  that  is  not  provided  in  the  current  design  capture,  such  as  a  resource  model.  This  missing 
information  must  be  added  to  transform  the  design  capture  to  performance  evaluation  models. 

The  second  problem  is  also  caused  by  the  different  objectives  of  these  two  representations.  For  example, 
in  the  data  flow  diagram,  the  functional  decomposition  in  the  design  capture  stops  when  a  systems  engineer  can 
map  a  function  to  a  resource  (whether  it  be  hardware,  software,  humanware,  or  a  combination).  However,  even 
at  the  lowest  level  of  the  functional  decompositions,  there  may  be  a  lot  of  ambiguities  due  to  multiple  inputs  and 
outputs  of  a  single  process  bubble  and  a  sequence  of  interactions  of  the  processes.  These  ambiguities  must  be 
removed  for  many  modeling  methods. 

The  flow  of  information  from  the  design  capture  to  a  performance  model  is  difficult.  But  having  the 
information  flow  in  the  other  direction,  from  performance  evaluation  model  to  design  capture,  is  more  difficult. 
This  task,  however,  may  be  more  crucial  for  the  widespread  use  of  the  systems  engineering  tools.  Once  an 
evaluation  is  performed,  information  obtained  from  the  evaluation  needs  to  be  fed  back  into  the  design  capture. 
This  information  is  pertinent  to  the  decisions  made  at  the  design  capture,  yet  there  is  a  lack  of  formal  methods 
to  feed  back  this  information.  There  may  also  be  design  decisions  made  on  the  system  during  the  evaluation 
phase,  on  the  performance  model.  These  decisions  also  have  to  be  reflected  back  on  the  design  specifications. 
This  transformation  also  validates  that  the  design  matches  the  evaluation. 

An  example  of  the  problem  that  may  arise  in  a  direct  transformation  from  performance  evaluation  models 
to  design  capture  is  separation  of  hardware,  software,  and  functionality.  When  the  design  capture  is  transformed 
to  performance  models,  some  of  the  functions  are  performed  by  the  hardware,  some  are  mapped  to  software 
that  gets  mapped  to  the  hardware,  and  some  are  mapped  to  humanware.  Differentiating  the  changes  in  the 
performance  mrdel  (into  what  actually  is  a  function  and  what  is  just  a  resource)  may  not  be  done  automatically. 

There  are  three  goals  which  address  this  issue  of  having  a  bi-directional  flow  of  'mformation:  short-term, 
mid-term,  and  long-term.  A  short-term  goal  is  to  attach  a  note  of  the  changes  and  pass  it  to  a  CASE 
representation.  The  decisions  on  making  changes  then  fall  on  the  design  capture  side.  A  mid-term  goal  is  to 
semi-automate  the  transformation.  Someone  working  with  a  tool  like  DESTINATION  (NGU91],  which  is  used 


577 


to  perform  resource  allocations  and  optimization,  has  an  access  to  both  design  capture  representations  and 
performance  evaluation  models.  Once  the  changes  are  made  to  the  performance  models,  DESTINATION 
resource  candidate  allocation  may  be  used  to  map  the  changes.  These  changes  may  be  as  simple  as  the 
reallocation  of  existing  resources.  In  this  case,  changes  will  be  identified  in  the  resource  candidate  allocation, 
and  they  may  be  made  in  the  DESTINATION  in  a  semi-automated  fashion.  The  long-term  goal  is,  of  course, 
to  automate  the  process. 

Global  Representation  ?: 

A  global  representation  is  not  something  that  can  be  obtained  today  or  tomorrow  as  much  as  it  is  a  goal 
(or  a  focal  point)  to  which  the  industry  is  moving.  Whether  this  goal  is  achievable  is  questionable  [ZAV91]. 
At  this  moment,  there  is  a  lack  of  a  seamless  process  from  requirements  specification  to  design,  evaluation,  and 
the  implementation  phases  of  the  system  development  cycle.  A  global  representation  could  serve  as  a 
requirements  specification  and,  if  it  is  robust,  could  serve  as  a  design  specification. 

At  this  time,  there  is  research  on  forming  a  standard  CASE  data  interchange  format  and  some  research 
on  developing  pseudo-standard  performance  models.  Convergence  of  these  representations  may  be  the  unifying 
representation-harbingers  of  the  global  representation.  However,  these  representation  capture  only  a  small 
amount  of  the  information  needed. 

The  first  of  many  problems  with  forming  a  global  representation  is  balancing  the  amount  of  information. 
The  objective  would  be  to  have  just  enough  information  to  specify  the  system,  be  able  to  perform  analysis,  and 
use  it  to  implement  the  system;  otherwise,  redundancies  and  information  overload  would  arise. 

The  second  problem  is  maintaining  traceability  of  the  information.  The  global  representation  of  a  system 
contains  more  information  than  needed  to  perform  certain  functions  (such  as  timing  analysis);  therefore,  a 
transformation  would  be  used  to  perform  these  functions.  After  these  functions  arc  performed,  any  changes 
made  must  be  reflected  back  to  the  global  representation.  Although  the  changes  to  the  global  representation 
would  be  guaranteed  by  the  correctness  of  the  transformation,  there  needs  to  be  a  way  to  trace  the  changes  in 
the  global  representation  to  insure  that  other  aspects  of  the  system  are  not  affected  by  those  changes. 

Conclusions: 

The  issues  that  are  addressed  in  this  paper  are  not  by  any  means  complete.  They  do  represent  some  of 
the  major  issues  that  have  been  encountered  in  this  on-going  research  effort. 
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1.  INTRODUCTION 

Top-level  system  requirements  are  analogous  to  software  requirements  in  many  respects.  The 
purpose  of  each  is  to  describe  “what”  is  required  for  a  system,  subject  to  constraints,  without  giv¬ 
ing  details  of  “how”  this  should  be  accomplished.  Historically  the  methods  and  tools  for  software 
requirements  specification  have  been  adopted  as  also  suitable  to  systems  requirements  (Dorfman 
1990]. 

For  a  small,  single  computer-based  system  there  may  be  little  distinction  between  the  system 
and  software  requirements.  However,  for  large  complex  systems  the  software  comprises  just  some 
of  the  components  of  what  may  be  a  distributed  system  with  many  hardware,  software,  and 
human-oriented  subsystems.  Ideally  the  system  requirements  should  be  understandable  both  by  the 
customers  (as  C-requirements)  and  by  the  system  developers  (as  D-requirements)  (Rombach  1990]. 
Developers  require  a  precise  statement  of  requirements  that  can  be  verified,  i.e.,  there  must  be 
some  cost-effective  procedure  for  determining  whether  an  implementation  sat^fies  the  require¬ 
ments.  On  the  other  hand,  such  a  precise  statement  may  require  concepts  and  notations  that  are 
unfamiliar  to  the  customers.  More  realistically,  one  must  be  able  at  least  to  validate  the  require¬ 
ments  in  some  manner  with  respect  to  the  customer’s  needs. 

The  SCR  (Software  Cost  Reduction)  methodology  attempts  to  provide  a  basis  for  customer 
as  well  as  developer  understanding  of  software  requirements  by  the  use  of  semi-formal  representa¬ 
tions  and  a  well-defined  set  of  of  principles.  The  application  of  these  principles  is  demonstrated  in 
a  complete  example  of  the  software  requirements  of  an  actual  Navy  system:  the  operational  flight 
program  of  the  A-7E  aircraft  [Heninger  1980,  Alspaugh  et  al.  1992).  The  formal  representations 
are  largely  based  upon  finite  state  machine  models  for  representing  the  system  behavior.  The  SCR 
methodology  emphasizes  an  external  “black-box”  view  of  the  system  without  any  premature, 
design-level  partitioning  of  the  system. 

In  light  of  these  objectives  of  the  SCR  methodology,  we  propose  extensions  to  that  methodol¬ 
ogy  to  handle  system  requirements  rather  than  just  software  requirements.  These  extensions  are 
based  in  part  on  [van  Schouwen  1990]  and  [Rose  et  al.  1991]  with  additional  insight  from  [Clements 
et  al.  1992]  and  [Hester  et  al.  1981]. 

We  first  introduce  some  general  principles  that  guide  our  development  of  a  systems  require¬ 
ments  document.  Next,  we  outline  how  these  principles  can  be  used  to  organize  toi>leYel  systems 
requirements  following  the  pattern  of  the  SCR  methodology.  Details  of  much  of  this  organization 
are  yet  to  be  determined;  much  of  our  focus  in  the  discussion  below  is  on  the  issues  to  be  resolved. 
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2.  GENERAL  PRINCIPLES 


•  The  specification  should  anticipate  likely  sv-i-em  changes  (e.g.,  changing  reqiiiremenls,  initial 
subsets  to  be  expanded  later,  changing  hardware  characteristics,  etc.)-  To  facilita'c  this,  the 
emphasis  should  be  on  specif}  ing  a  family  of  related  systems  rather  than  a  single  sc.stem. 
Likely  changes  should  be  clearly  documented  so  that  the  succeeding  system  design  can  be 
tailored  to  make  those  changes  easy. 

•  The  requirements  specification  should  be  organized  to  separate  concerns.  For  example,  likely 
system  changes  should  be  separated  from  the  rest  of  the  specification.  Many  forms  of  abstrac¬ 
tion  also  promote  separation  of  concerns  as  well  as  ease  of  change,  e.g.  symbolic  representa¬ 
tion  of  data  values  to  separate  concerns  of  representation  and  usage  within  the  system. 

•  Development  of  the  specification  should  focus  on  the  questions  that  need  answering  before 
formulating  the  answers.  Rather  than  prematurely  giving  an  answer,  it  is  appropriate  to 
record  each  unresolved  issue  with  some  convention  such  as  “TBD.” 

•  To  control  redundancy,  each  entity  should  be  defined  in  a  single  place  (whether  this  be  in  the 
specification  per  se,  or  at  a  higher  level  as  part  of  a  specification  generator).  As  necessary 
there  should  be  automated  control  (e.g.,  macros)  for  requisite  multiple  copies  of  such  entities 
to  preserve  consistency.  Examples  from  the  SCR  methodology  are  data  types,  system  genera¬ 
tion  parameters,  and  terms. 

•  The  requirements  specification  should  be  a  reference  document;  i.e.  the  emphasis  should  be 
on  finding  specific  information  rather  than  giving  a  general  overview  of  Mie  system.  A  general 
overview  may  be  included  in  this  document  or  furnished  separately.  To  facilitate  use  as  a 
reference  document,  various  indices  and  cross  references  (or  their  automated  equivalent  in, 
e.g.,  hypertext)  must  be  included. 

•  Care  should  be  taken  that  only  requirements  are  included  in  the  requirements  document. 
Premature  design  decisions  must  be  avoided.  Required  design  constraints  should  be  clearly 
marked. 

•  The  requirements  should  be  stated  as  formally  .s  possible  since  formal  notations  are  more 
likely  to  be  consistent,  unambiguous,  and  concise. 


3.  TOWARDS  A  REQUIREMENTS  STRUCTURE 

We  consider  three  broad  areas  of  system  requirements  [STARTS  1987]: 

•  Functional  requirements  describe  “what”  behavior  is  required  of  the  system. 

•  Nonfunctional  requirements  are  attributes  of  the  system  not  covered  by  the  functional 
requirements. 

•  System  development  requirements  address  the  process  by  which  the  system  is  developed  and 
evolves  over  its  lifetime. 

The  functionality  of  the  system  should  be  the  focus  in  organizing  system  requirements 
[STARTS  1987].  Nonfunctional  requirements  and  system  development  requirements  should  not  be 
treated  as  second-class  citizens  but  should  complement  the  major  structure  provided  by  the  func¬ 
tional  requirements.  The  mechanism  for  implementing  this  structure  is  left  open.  .Requirements 
specifications  stored  within  a  database  allow  for  flexibility  in  that  many  different  groupings  of  data 
are  available.  For  exposition  we  shall  consider  a  division  into  chapters  and  sections  with  the 
understanding  that  each  chapter  or  section  represents  a  logical  view  of  the  overall  requirement.s 
databa.se. 
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Chapter  1:  INTRODUCTION 

The  introduction  includes  such  information  as  organizing  principles,  notations,  a  brief  over¬ 
view  of  the  system,  and  synopses  of  the  remaining  chapters. 


Chapter  2;  FUNCTIONAL  REQUIREMENTS 

We  partition  the  system  under  development  into  the  environment  within  wlv  h  the  system 
w’ill  function,  and  the  system  itself.  The  Environmental  State  Variables  represent  the  interface  to 
the  environment.  The  System  Modes  provide  the  control  model  for  the  behavior  of  the  system  as 
a  simplification  of  the  overall  state  of  the  environment.  Finally  the  System  Functions  describe  the 
actual  behaviors  of  the  system  as  related  to  Environmental  State  Variables  and  the  System  Modes. 


Section  2.1:  Environmental  State  Variables 

Environmental  (state)  variables  (van  Schouwen  1990,  Parnas  and  Madey  1991]  are  defined  in 
this  chapter.  These  time-varying  variables  model  the  external  environment  of  the  system.  They 
include  physical  quantities  (e.g.,  temperature  and  pressure),  readouts  of  displays,  and  even  human 
user  characteristics  (e.g.,  typing  speed  of  operators  external  to  a  system).  Each  variable  is  either  a 
monitored  variable,  which  is  measured  as  input  to  the  system,  or  a  controlled  variable,  which  is  a 
quantity  that  must  be  controlled  by  the  system,  or  a  variable  may  be  both  monitored  and  con¬ 
trolled.  The  environmental  variables  express  the  interface  to  the  system  as  well  as  additional 
relevant  factors  in  the  environment.  It  is  important  to  include  environmental  restrictions  (e.g., 
physical  laws)  in  order  that  the  environmental  model  sufficiently  reflects  reality. 

ISSUES; 

•  How  extensive  must  the  environmental  model  (captured  by  the  environmental  variables  and 
restrictions)  be?  It  must  capture  those  aspects  relevant  to  the  system  being  specified;  the  res¬ 
trictions  should  rule  out  system  states  that  are  impossible.  It  is  perhaps  better  to  overspecify 
the  environment  than  to  underspecify  it. 

•  What  is  the  best  way  to  structure  the  environmental  variables?  For  which  types  of  systems 
may  object-oriented  techniques  be  useful  in  structuring  the  environmental  variables? 


Section  2.2;  System  Modes 

A  mode  class  represents  an  equivalence  class  of  the  overall  system  state,  i.e.,  the  individual 
modes  of  a  mode  class  partition  the  system— each  system  state  belongs  to  exactly  one  mode. 
Modes  are  useful  in  simplifying  complex  environmental  conditions  and  in  capturing  useful  history 
of  the  system  evolution.  Transitions  between  modes  correspond  to  the  event  of  one  or  more 
environmental  variables  having  changed  (or  may  be  defined  via  conditions  of  the  environment 
rather  than  events).  Several  different  mode  classes  may  be  useful  in  simplifying  the  expression  of 
system  requirements.  Modes  should  repre.sent  only  externally  visible  aspects  of  the  system  state. 
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iSSlfSS: 


•  How  does;  one  ensure  thai  only  externally  visible  beiiavK;'-  is  cantui*--!  vj.i  uu  !.  •  '  t.  s 

this  is  that  the  modes  of  a  movie  class  only  capture  infciniativ.n  vus  <-!0  i?.  aiot  i.U;!  \  a; 
even  if  tluxse  environmental  variables  are  not  exphcitiv  speeihed 

•  What  is  the  best  way  to  represent  a  movie  class  and  its  iraiiMtions'  Hv.th  tal  ul:..! 

(e.g  ,  SCR  tables  iA!s[):iugh  el  al  199:11.  van  Schouwen  IdtitS  j  and  grapha  li  (e  ^ 

Statecharts  JHarei  19871  and  Modechart  iJahaman  ei  a!  19s8  j  should  i  e  ton-id*  ini  ,-.ii  h  liu- 
its  advantages  and  disadvantages  fv>r  ease  of  change,  understaiuhibilitv  ett 


Section  2.3:  Descriptions  of  System  Functions 

Al!  system  functions  should  be  described  in  this  chapter  ICuh  system  fum non  slumld  i  t  a 
description  of  the  external  beliavior  of  some  .as|)ect  of  the  system  in  terms  v;f  the  state  of  the 
environment.  In  describing  each  system  function,  one  .aust  include  the  relevant  modes  durmg 
which  that  function  is  applicable  We  include  Ixjth  functions  during  nv'rmal  oiaiation  :uiil  eire,’ 
situations,  since  it  may  be  difficult  to  distinguish  these,  nr  there  may  be  several  level-  f.f  degraded 
performance  that  must  be  s|)ecified 

ISSUES: 

•  W'hat  is  the  best  way  to  structure  the  various  functions''  !)n  we  want  to  siruclure  a.s  normal 
versus  abnormal  operation  or  use  some  other  criterion'’ 

•  What  is  tlie  best  way  of  relating  nonfunctional  attributes  to  system  funci ions''  The  rerfun- 
mendation  in  [van  Schouwen  I990|  is  that  the  functionality  should  lx-  fcKused  uj»on  an  ideal 
model  of  the  system  with  liming,  accuracy,  and  other  non-functional  characteristics  rntrolijr- 
iiig  the  allowable  tolerances  to  this  ideal  model.  We  lieiieve  that  such  an  ideal  system  mode! 
abstracts  out  loo  much  of  the  requisite  behavior  of  the  system,  e  g  ,  timing  and  aecurarv 
should  be  intrinsic  parts  of  the  system’s  functional  behavior 


Chapter  3:  NOP  UNCTIONAL  REQUIREMENTS 

W'e  consider  two  areas  of  the  nonfunctional  requirement.s 

•  De.sign  constraints  express  “liow”  functional  requirement*  must  lie  refined 

•  Additional  nonfunctional  constraints  address  other  aspects  of  the  system 


Section  3.1:  Design  Constraints 

Design  constraints  include  descriptions  of  the  major  system  subcomijonents  whose  fyiirtional- 
ity  has  been  siiecifieci  to  a  level  of  detail  that  is  more  appropriate  to  the  system  design,  i  e  ,  there  r- 
some  "white  box”  view  of  the  functionality  -Miller  198dl  These  const raineti  subcomponent- 
include  hardware,  .software,  and  human  interaction  within  the  system  riKot  roinnioidv  Imman- 
compiiter  interaction.  In  a  genera!  .systems  .setting  it  must  l>e  clarified  w  hich  hum.in  infcrarii<>n- 
are  external  to  the  system,  and  which  are  interna!  to  the  svstrm. 
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This  ciiapter  sliouici  basically  include  whal  would  be  found  in  tin-  sysiciu  ilcsi^n  (i« H  unn  iiis 
for  those  eornponents,  only  moved  to  the  requiieiaeiits  dcxunieiii  and  clearly  labeled  a.s  ile-iCi;  <11- 
slraiius.  The  detailed  design  information  here  should  not  take  the  place  of  the  retpin i-ne  i.i  i;  o  ti 
elsewhere  in  the  document,  but  should  be  u  reiineineiil  of  those  lequireineiits  in  oihei  u.  id"  Uu- 
constraints  should  be  removable  w  ithout  alTecting  the  rest  of  the  dixurnent 

ISSUES: 

•  Why  not  let  a  design  constraint  replace  the  more  abstract  requirement,  wlncli  the  design  (  .di¬ 
straint  refines?  !t  may  be  tempting  to  take  this  shortcut  in  order  to  shorten  the  letjuii eim  nts 
document.  This  is  not  to  be  recommended  since  it  muddles  an  abstract  view  of  the  overall 
system  due  to  the  level  of  detail,  and  it  makes  changes  to  design  constraints  more  difficult 


Section  3.2:  Other  Nonfunctional  Requirements 

Additional  nonfunctionai  requiremenis  cover  a  broad  range  for  general  sysiemv  We  ideinify 
three  categories  here  as  examples; 

•  Interface  constraints:  hardware  interfaces,  .software  inlerface.^.  inan  inac  lime  mterfacc.s 
(MMI’s). 

•  Dependability  constraints:  safety,  security,  stoclutslic  perrormatice.  deterministic  perfor¬ 
mance,  reliability,  availability,  accuracy,  maintainability. 

•  Physical  requirements:  dimensional  limits,  power  consumption,  environmental  confliiions 
(weather,  noise,  radiation),  climate  control,  etc. 

As  with  functional  requirements,  care  should  be  taken  that  these  other  non-fuiictioiia! 
requirements  do  not  overconstrain  the  specification  to  rule  out  what  might  be  acceptable  designs 
and  implementations.  For  example,  fault-tolerant  computer  hardware  should  be  constdered  as  one 
jx)s.sible  system  design  in  achieving  an  overall  system  reliability. 

ISSUES: 

•  How  should  nonfunctional  requirements  be  organized?  The  many  different  forms  of  nonfunc¬ 
tional  constraints  in  a  general  system  differ  widely  in  scope  and  interdependence 

•  To  what  extent  can  various  categories  of  nonfunctional  requirements  be  expressed  formally'’ 
Is  it  useful  to  have  at  least  a  formal  syntax  or  template  for  expressing  such  requirements  (or 
would  tliat  be  too  restrictive!? 


Chapter  4:  SYSTEM  DEVELOPMENT  REQUIREMENTS 

In  building  large  complex  systems  the  cu.storner  may  also  require  specific  methods,  ux)!s,  and 
lirocedures  to  be  follow'ed  to  ensure  that  the  system  development  process  is  under  control 
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Section  4.1:  Required  Subsets 

Required  subsets  (e.g.,  phased  deliverables)  and  variants  of  a  system  are  documented  here 
Ap])ropriate  information  that  may  differ  among  subsets  or  variants  may  include  timing  requiie- 
ments.  hardware  requirements,  physical  requirements,  etc. 


Section  4.2:  Expected  Types  of  Changes 

Special  care  should  go  into  documenting  change  within  a  system.  It  is  im}>orlant  to  record 
fundamental  assumptions  (  i.e.,  those  aspects  and  decisions  that  will  never  change  over  any  sy.slem 
meeting  the  requirements)  as  well  as  changeable  assumptions.  These  two  types  of  assumptions 
apply  to  all  areas  of  the  requirements  specification. 

Expanding  upon  these  assumptions,  there  should  also  be  rationale,  as  appropriate,  gathered 
during  any  analysis  that  preceded  the  establishment  of  the  system  requirements.  Rationale  is  espe¬ 
cially  important  for  system  design  constraints. 

IS.SUES: 

•  Should  fundamental  and  cliangeable  assumptions  be  grouped  together?  It  is  probably  clearer 
to  separate  them, 

•  Are  fundamental  unchangeable  assumptions  necessary?  One  view  is  that  anything  not  expli¬ 
citly  stated  as  changeable  is  implicitly  unchangeable.  We  disagree  with  this  view  since  impli¬ 
cit  assumptions  are  likely  to  be  ambiguous. 

•  How  much  rationale  should  be  included  here  (or  in  companion  documents)’  Some  rationale  is 
necessary  as  a  check  that  the  recorded  assumptions  are  valid. 


Section  4.3:  Other  System  Development  Requirements 

Other  requirements  in  tins  category  include  life  cycle  concerns  (e.g.,  testing  requirements, 
documentation  standards),  installation  procedures,  project  management,  and  quality  assurance 
[STARTS  1987], 

ISSUES: 

•  Similar  to  Other  Nonfunctional  Requirements,  it  is  difficult  to  formalize  process-oriented 
lequirement.s. 


Chapter  5:  GLOSSARY  OF  TERMS 

This  glossary  covers  all  terms,  jargon,  abbreviations,  conventions,  etc.  that  are  normally  used 
within  the  general  domain  of  the  system  (e.g,  the  avionics  domain).  The  main  purpose  is  to  pro¬ 
vide  the  siiecifiers  with  the  requisite  domain  terminology  for  communication  with  the  ru.sforner.  If 
appiojuiate,  reference  to  .standard  glossaries  of  domain  terminology  (cited  in  Sources  of  Additional 
Information)  may  replace  part  of  this  chapter. 
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Additionally  special  terms  (bracketed  by  ! — !  in  A-7E  requirements)  specific  to  this  particular 
project  should  be  defined.  These  special  terms  provide  a  mechanism  for  consistently  recording  and 
using  names  that  might  otherwise  be  confused  with  informal  usage.  Term  definitions  may  lange 
from  informal  natural  language  descriptions  to  formal  descriptions  such  as  macros,  mnemonics  for 
system  parameters,  or  abbreviations  for  complex  mode  expressions.  In  any  case  a  term  is  useful  for 
hiding  details  of  a  concept,  i.e.,  separating  the  concern  of  where  and  how  that  concept  is  used 
versus  its  detailed  definition. 

The  terms  defined  here  should  be  used  consistently  throughout  the  rest  of  the  document  to 
aid  in  customer  understanding  as  well  as  conciseness.  For  example,  relevant  terms  should  use  the 
same  acronyms  as  those  defined  for  the  domain. 

ISSUES: 

•  What  is  the  best  way  to  organize  the  different  sorts  of  items  in  the  glossary? 


Chapter  8;  SOURCES  OF  ADDITIONAL  INFORMATION 

This  section  gives  references  to  all  relevant  publications  related  to  the  system  specification 
(e.g.,  computer  manuals,  appropriate  standards,  detailed  hardware  descriptions).  It  also  records 
names,  addresses,  phone  numbers,  etc.  of  all  personnel  involved  as  either  customers  or  systems 
requirements  specifiers  indicating  their  role  or  expertise  related  to  the  requirements  specification. 
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Abstract 

The  de''elopinent  and  maintenance  of  real-time  sysierns  has  l>ecom<’  more 
and  more  iniportant.  However,  the  too)  support  for  reuse  of  real-time  compo¬ 
nents  is  by  far  not  as  elaborated  and  conceptually  well  defined  as  this  would  be 
necessary.  We  present  an  approach  that  combines  existing  technology  in  soft- 
ware  reuse,  formal  specification,  real-time schedulahility  analysis,  and  advanced 
matching  techniques  into  a  comprehensive  framework  for  reuse  of  real-time  soft¬ 
ware.  The  basic  concept  is  the  reuse  of  software  components  which  are  .stored 
in  a  highly  structured  library  system.  A  typical  component  exports  a  type  and 
operations  to  be  used  on  variables  of  that  type.  Compotients/modules  may 
be  generic  (i.e.,  parameterized  by  types,  by  operations,  or  even  by  other  mod 
uies).  To  be  used,  modules  must  be  instantiated.  This  structural  element  of 
tiie  Cornponetit  Manager  is  based  on  experience  with  two  prototypes,  the  Soft¬ 
ware  .Archive  [11,  9,  12]  and  the  RESOLVE  specification  language  [5,  2.  4].  To 
support  the  formal  specification  of  real-time  software,  RESOLVE  is  extended 
with  special  constructs  to  express  information  regarding  timing,  periodicity, 
etc.  Within  this  context,  wc  consider  a  mix  of  user-guided  and  automated  re¬ 
trieval/classification  achieved  by  structural  support  and  formal  specification. 
The  results  of  this  ’’local”  matching  effort  are  used  to  conduct  an  evaluation 
(’’global”  matching)  by  running  an  analysis  of  how  the  target  system  will  re¬ 
act,  with  regard  to  its  timing  properties,  to  a  reuse  of  the  found  component. 
This  paper  gives  an  overview  of  the  static  structuri*  u.sed  to  specify  the  Com¬ 
ponent  Manager.  This  part  of  our  work  fs  based  on  the  Software  Archive  con¬ 
cepts.  It  also  discusses  extended  RESOLVE,  a  language  for  the  specification  of 
reusable  real-time  components  and  .systems.  Additionally,  it  pre.sents  a  basic 
algorithm  for  (automated)  retrieval  of  components  in  the  Component  Manager 
and  presents  the  concepts  of  global  matching  with  existing  systems. 


keywords:  software  reuse,  component  libraries,  formal  specification  for  real¬ 
time,  schedulahility  analysis,  global  and  local  matvhing 
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1  Introduction 


Software  reuse  is  one  of  the  main  factors  in  increasing  software  [)roductivily. 
However,  while  we  can  see  a  wide  variety  of  approved  concepts  and  even  a  few 
successful  practical  applications  in  the  area  of  business  and  information  systems 
[19],  we  still  have  problems  in  application  domains  where  performance,  timing 
behavior,  and  other  non-functional  attributes  are  essential. 

Our  goal  is  to  develop  a  reuse  support  tools  that  addres.ses  these  problems, 
especially  the  needs  and  requirements  of  real-time  systems  development.  This 
paper  presents  the  conceptual  framework  for  this  approach,  discusses  existing 
prototypes  and  techniques,  and  integrates  them  into  a  comprehensive  tool  en¬ 
vironment,  the  Component  Manager. 

There  are  four  main  elements  in  this  framework  that  characterize  our  ap¬ 
proach; 

•  A  highly  structured  library  of  components. 

•  A  hybrid  mix  of  interactive  and  automated  retrieval. 

•  Special  supi)ort  for  formal  specification  of  real-time  characteristics. 

•  An  evaluation  of  impacts  on  the  target  system. 

The  Component  Manger  is  based  on  the  well-known  and  successful  concept 
of  a  library  of  components.  However,  this  library  is  not  llat,  providing  an 
unstructured  set  of  component  descriptions,  but  relies  heavily  on  an  explicitly 
defined  classification  structure.  This  structure  is  not  only  visible  to  t  he  u.ser  but 
it  is  also  specified  and  maintained  by  him.  This  static  structure  is  the  basis  for 
all  other  elements  of  the  Component  Manager  and  allows  one  to  mix  relatively 
informal  concepts  with  formal  specifications. 

Using  both  aspects,  informal  and  formal  elements,  retrieval  and  classification 
utilize  the  structure  to  provide  the  u.ser  with  a  highly  flexible  interface.  It  is 
open  to  the  users  to  decide  if  they  want  to  use  the  system  in  a  user-driven, 
interactive  browser-mode,  or  if  they  want  to  hand  over  to  automated  matching 
algorithms  that  take  a  formal  specification  as  input  and  match  il  to  a  formal 
specification  in  the  library.  What  is  common  to  both  alternatives  is  that  they 
rely  on  the  predefined  static  structure  to  narrow  the  search  space  instead  of 
trying  to  match  with  all  components  in  the  library. 

By  mixing  the  informal  and  the  formal  model  for  reu.se  libraries  [111],  we 
achieve  a  situation  where  the  user  can  guide  the  system  interactively  by  navi¬ 
gating  in  the  structure  of  the  Component  Manager  without  loosing  the  advan¬ 
tage  of  machine  support  to  handle  bulk  data.  This  hybrid  approach  allows  one 
a  better  fine  tuning  and  guidance  of  the  system  functionality  and  is  a  possible 
solution  to  the  problem  of  inflexible  algorithms  that  are  targeted  only  towards 
one  aspect  of  the  component  description  and  can  not  handle  other  attributes, 
e  g.  timing  specifications. 
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To  support  this  type  of  system,  every  component  description  must  have  a 
formal  and  an  informal  part.  The  informal  part  is  mainly  used  during  the 
browsing  process  and  takes  the  form  of  a  multipurpose  component  description 
that  can  be  used  in  different  classification  structures  (see  section  2).  The  formal 
part  supports  the  automated  retrieval,  a  task  that  can  not  be  fulfilled  by  the 
informal,  free  format  elements.  The  formal  specification  also  includes  special 
constructs  to  define  the  real-time  aspects  of  the  component. 

These  real-time  characteristics  are  not  only  used  to  decide  if  a  component 
fits  the  needs  of  a  user  by  comparing  component  description  and  needs  state¬ 
ment,  they  are  also  the  major  input  to  evaluate  the  found  component(s)  as  part 
of  the  target  system.  As  opposed  to  the  (local)  matching  of  component  char¬ 
acteristics  during  retrieval  in  the  library,  this  step  recalculates  the  behavior  of 
the  whole  system  whose  e.xecution  is  influenced  by  timing  and  resource  needs 
of  the  currently  evaluated  component.  This  global  matching  can  be  seen  as 
an  additional  evaluation  step  which,  if  necessary,  reorders  the  list  of ’’locally’’ 
matching  components. 

Section  2  gives  an  overview  of  the  static  structure  used  to  specify  the  Com¬ 
ponent  Manager.  This  part  of  our  tvork  is  based  on  the  Software  Archive  con¬ 
cepts.  Section  3  describes  extended  RESOLVE,  a  language  for  the  specification 
of  reusable  real-time  components  and  systems,  augmenting  the  informal  concept 
of  the  Component  Manager  and  supporting  automated  retrieval  and  classifica¬ 
tion.  Section  4  presents  the  basic  algorithm  for  retrieval  of  components  in  the 
Component  Manager  and  discusses  aspects  of  the  local  matching  algorithm.  The 
process  of  global  matching,  on  the  basis  of  incremental  schedulability  analysis, 
and  its  relation  to  local  matching  is  outlined  in  section  5. 

2  The  Basic  Structure  of  the  Component  Man¬ 
ager 

The  structure  of  the  Component  Manager  is  derived  from  a  generalized  view  of 
the  software  development  process  [9],  and  achieves  independence  from  certain 
applications  or  projects.  Each  Component  is  described  as  a  unique  entity  and 
linked  to  other  components  in  the  library  by  application  independent  relation- 
•ships  [10,  12]. 

A  component  is  not  restricted  to  a  representation  on  the  source  code  level. 
It  may  have  any  form  of  representation  ranging  from  specification  to  source 
code.  Furthermore,  potentially  reusable  components  are  classified  according  to 
their  level  of  decomposition,  e.g.,  as  systems  or  sub-,systems.  Similarity  between 
components  is  described  by  using  a  generalization  hierarchy  with  strict  attribute 
inheritance. 

A  component  is  a  part  of  an  existing  system,  it  is  reviewed  for  possible 
reuse  and  incorporated  into  the  Component  Manager  by  classifying  it  according 
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to  its  structural  relations  to  the  other  modules  in  the  repository.  A  module 
description  -  as  far  as  it  will  be  described  here  -  consists  of  the  following  parts: 

•  Contents  definition. 

•  Interface  and  placement/positioning  information. 

The  contents  definition  includes  the  original  description  of  the  module,  as 
derived  during  its  development  process.  It  is  the  basis  to  define  attributes 
which  are  independent  from  a  notational  form  used  in  a  specific  project.  These 
attributes,  classifying  the  module,  are  also  stored  in  the  contents  definition  part 
of  the  module  description.  They  are  of  superior  importance  for  determining 
the  location  of  the  module  in  the  repository  structure  and  for  manual  retrieval 
operations. 

A  standardized  and  generic  specification  of  the  component  on  the  level  of 
’’concepts”  is  used  to  define  a  normalized  component  description.  These  con¬ 
cept  level  descriptions  support  easier  understanding  for  the  user  and  can  be  us(;d 
during  the  automated  retrieval  operations.  Related  to  such  a  concept  descrip¬ 
tion,  different  realizations  can  be  stored,  providing  design  and  implementation 
alternatives. 

The  interface  and  placement  description  contains  all  information  necessary  to 
determine  the  place  of  the  module  in  the  repository’s  structure,  thus  capturing 
its  relations  to  the  other  components.  To  support  this  goal,  the  global  struc¬ 
ture  of  the  Component  Manager  includes  as  its  main  elements  a  decomposition 
dimension  and  a  generalization  dimension  [10].  The  decomposition  dimension 
classifies  a  module  according  to  its  level  of  system  aggregation  and  is  imple¬ 
mented  by  PART-OF  relations.  The  generalization  dimension  models  similarity 
between  different  modules  (on  one  level  of  decomposition).  It  is  implemented 
by  an  ISA  relation. 

The  structure  of  the  Component  Manager,  with  its  main  factors  aggrega¬ 
tion/decomposition  and  generalization/specialization,  can  be  seen  as  shown  in 
figure  1.  There  exist  different  levels  of  decomposition.  Each  module  is  assigned 
to  exactly  one  of  these  levels  (and  may  be  linked  to  modules  on  other  levels 
by  means  of  a  PART-OF  relation,  not  show  in  the  diagram).  On  each  level  of 
decomposition  the  modules  are  structured  aw^cording  to  their  level  of  abstraction 
in  generalization  hierarchies. 

The  ISA  relation  is  defined  in  terms  proposed  in  [Ij.  It  is  a  relatively  static, 
form  of  ISA  relation,  including  strict  inheritance  of  attributes  from  ancestors 
but  excluding  multiple  inheritance.  The  attributes  inherited  are  the  clcissifying 
attributes  placed  in  the  contents  definition  of  the  module.  The  attributes  of  all 
related  modules  on  higher  levels  of  generalization  are  inherited  and  completed 
by  those  derived  from  the  current  modules’  contents  definition. 

Different  points  of  view  concerning  similarity  between  components  and/or 
different  ways  to  structure  the  search  space  can  be  expressed  by  the  user-view 
concept  [12].  This  concept  of  varying  views  for  different  groups  of  users  on 
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Figure  1:  The  basic  structure  of  the  Component  Manager 


the  same  conceptual  structure  is  comparable  to  the  user-view  concept  used  in 
database  applications. 

A  user-view  consists  of  components  linked  by  ISA  relationships,  thus  span¬ 
ning  the  two  dimensions  of  generalization  and  similarity-association  (see  fig¬ 
ure  1).  Using  generalization  and  attribute  inheritance,  a  class  liierarchy  of 
(similar)  components  is  constructed.  All  components  in  a  user-view  must  be 
classified  to  be  at  the  same  level  of  system  decomposition. 

If  we  take  into  account  that  the  notion  of  similarity  is  a  semantic  concept, 
which  is  determined  by  the  point  of  view  of  the  user,  we  have  to  recognize  that 
i1  will  not  be  adequate  to  enforce  the  use  of  only  one  such  classification  scheme 
on  a  given  level  of  decomposition.  Doing  so  would  make  it  difficult  or  even 
impossible  for  the  user  to  structure  the  set  of  components  according  to  his/her 
needs. 

Therefore,  to  support  modeling  of  alternative  generalizations  and  similar¬ 
ity  classifications,  the  Component  Manager  allows  the  user  to  build  different 
parallel  user-views  on  one  level  of  decomposition.  They  provide  the  means  to 
define  different  classification  structures,  reflecting  the  different  points  of  view  of 
different  user  groups  (such  as  projects  or  departments). 

This  means  that  a  component  is  classified  to  belong  to  exactly  one  level  of 
decomposition,  but  may  be  used  in  different  parallel  user-views.  The  user-views 
on  one  level  of  decomposition  represent  the  set  of  all  specified  classifications  for 
the  components  on  that  level. 

I'he  main  concept  of  (he  reuse  support  structure,  as  presented  so  far,  is  to 
divide  the  search  space  (the  set  of  all  stored  components)  into  user-views  and 
levels  of  decomposition.  This  structure  enables  a  user  to  retrieve  components 
by  browsing  through  the  system  by  hand  or  by  invoking  an  automated  search 
support  algorithm.  While  the  user  can  utilize  most  of  the  (rather  informal) 
information  stored  for  each  component,  the  algorithm  will  mainly  rely  on  the 
standardized  formal  specification  which  is  part  of  the  components’  coiiLent  def¬ 
inition. 


3  The  Specification  Language 

3.1  Basics  of  RESOLVE 

Each  specification  is  called  a  concept  and  may  have  multiple  implementations 
(called  realizations).  A  typical  module  exports  a  type  and  operations  to  be  used 
on  variables  of  that  type.  Modules  may  be  generic  (i.e.,  parameterized  by  types, 
by  functions  or  operations,  or  even  by  other  modules).  To  be  used,  a  module 
is  instantiated  by  fixing  its  parameters  and  choosing  one  realization.  We  refer 
to  a  module  instance  as  a  facility.  Each  module  has  an  initialization  operation 
that  executes  when  a  facility  is  created,  and  initializes  any  state  the  facility  may 
have.  Type  initialization  and  finalization  operations  must  be  provided  by  the 
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author  of  a  module  for  each  type  tliat  the  module  exports.  Tlie.se  operations 
are  required  to  simplify  proofs  of  correctness  of  implementations  and  so  that 
each  variable  can  be  automatically  initialized  (or  finalized)  upon  entry  to  (e.xit 
from)  an  operation  that  declares  it. 

in  our  work,  reusable  software  components  are  specified  using  RESOLVK 
[4,  20,  5,  2],  which  is  an  acronym  for  Reusable  Software  Language  with  V’eri- 
fiability  and  Efficiency.  In  RESOLVE,  every  program  type  is  modeled  using  a 
standard  mathematical  type  and  operations  on  the  program  type  are  explained 
using  notations  from  its  mathematical  type.  Specifications  are  written  using 
predicate  calculus  and  well  known  mathematical  theories  such  as  integer  the¬ 
ory,  real  number  theory,  boolean  algebra,  string  theory,  and  function  theory. 
For  example,  the  type  List  can  be  modeled  using  an  ordered  mathematical  pair 
of  mathematical  strings,  as  shown  below. 

concept  List_Template  (type  Item) 
type  List  is  modeled  by 

<left;  String  (Item), 
right:  String  (Item)> 

initially,  forall  L;  List,  L.left  =  Lambda  and  L. right  =  Lambda 
procedure  Reset(alters  L:  List) 

ensures  L.left  =  Lambda  and  L. right  =  #L.left  o  #L. right 

procedure  Advance(alters  L:  List) 
requires  L. right  /=  Lambda 
ensures  thereExists  Y:  Item,  s.t., 

L.left  =  #L.left  o  Y  and  #L. right  =  Y  o  L. right 

function  At_Right_End(preserves  L:  List)  returns  Boolecm 
ensures  At_Right_End  iff  L. right  =  Lambda 

procedure  Insert(alters  L:  List;  consumes  X:  Item) 
ensures  L.left  =  #L.left  and  L. right  =  #X  o  #L. right 

procedure  Remove(alters  L:  List;  produces  X:  Item) 
requires  L. right  /=  Lambda 

ensures  L.left  =  #L.left  and  J>L. right  =  X  o  L. right 

procedure  Swap_Right (alters  LI,  L2:  List); 
ensures  LI. left  =  #H  I’eft  and  LI. right  =  #L2. right  and 
L2.1eft  =  #L2.  eft  and  L2. right  =  #L1. right 
end  List_Template 

Different  mathematical  theories  can  be  used  to  reason  about  the  behavior 
and  abstract  structure  of  concepts.  In  each  concept,  the  mathematical  model  of 
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till)  tyi>o  is  stated  and  tin;  initial  value  for  variiahles  of  the  type  an^  >;iveii.  '1  he 
inte  rfaces  of  the  operations  and  descriptions  of  their  behaviors  follow  the  type 
description.  Each  operation  has  an  optional  reqnircs  clan.se,  a  pre-condition  that 
iniist  he  satisfied  before  the  operation  is  invoked.  Each  operation  also  ha-s  an 
an  ensures  clan.se  describing  the  effect  of  the  opera!  ion  when  the  reqni[(;.s  clause 
is  satisfied.  Ensures  clauses  refer  to  the  old  values  of  parameters  by  preceding 
their  names  with  a  pound  sign  (#). 

The  type  List  can  he  viewed  as  an  ordered  niathcrnatical  pair  of  mathemati¬ 
cal  strings  (as  shown  in  an  earlier  section  of  this  paper).  Considering  the  former 
model,  wc  see  that  the  list  is  treated  as  two  .segments,  left  and  right.  When  a 
client  of  the  List-Template  declares  a  list  variable,  it  is  given  the  initial  value  de- 
•scrihed  in  the  initialization  section — an  empty  li.st.  An  imaginary  cursor  points 
to  the  position  in  the  list  between  the  left  and  right  portions.  Thus,  the  Inser! 
and  Remove  operations  affect  the  leftmost  item  on  the  right  half  of  the  list.  The 
Insert  operation  is  described  as  the  concatenation  of  an  item  onto  the  end  of  a 
string.  Similarly,  the  Remove  operation  deletes  the  leftmost  item  of  the  right 
string.  Advance  moves  the  leftmost  item  of  the  right  string  into  the  riglitmosl. 
position  of  the  left  string.  Reset  moves  the  cursor  to  the  left  of  the  entire  list 
and  the  operation  At-Right.End  returns  true  if  and  only  if  the  right  string  is 
empty.  The  Swap.Right  operation  exchanges  the  right  portions  of  two  different 
lists,  permitting  implementations  that  provide  efficient  swapping  of  lists. 

The  RESOLVE  specification  language  provides  additional  features  that  did 
not  appear  in  the  specification  of  the  List.Template.  Items  that  can  be  declared 
are  mathematical  constants,  variables  and  functions.  These  are  used  only  for 
stating  the  conceptual  view  of  a  module,  but  typically  have  counterparts  in 
realizations  of  the  module.  Additional  information  that  can  be  provided  includes 
constraints  on  behavior  of  the  module,  lemmas  (to  be  used  in  understanding  the 
module  and  to  simplify  proofs  of  correctness),  and  the  initial  state  of  the  modulo. 
These  items  are  stated  as  assertions  using  predicate  calculus.  The  definitions 
of  j)rovided  types  (in  the  interface  section)  can  contain  such  information  ns 
constraints  on  states  of  variables  having  the  type,  lemmas,  and  initial  and  final 
values.  Parameters  to  a  concept  can  be  types  (as  seen  in  the  List.Template), 
mathematical  functions,  or  other  concepts.  When  a  concept  is  passed  as  a 
parameter,  any  operations  and  types  that  it  provides,  and  anything  defined 
in  its  auxiliary  section,  can  be  referenced.  To  reference  operations  and  types 
provided  by  a  concept  parameter,  they  must  be  fully  qualified.  Additionally, 
a  concept  can  access  anything  passed  as  a  parameter  to  a  concept  that  is  a 
parameter  to  it.  For  example,  if  concept  li  i.s  a  parameter  to  concept  g,  and  g 
is  a  parameter  to  concept  /,  then  assertions  in  g  can  reference  definitions  from 
h;  and  assertions  in  /  can  reference  definitions  from  g  and  fi. 

To  assure  that  the  correct  parameters  are  u.sed  when  modules  are  instanti¬ 
ated,  restrictions  can  be  stated  in  the  parameters  sections  of  modules.  Restric¬ 
tions  can  be  (1)  TYPE.NAME  =  TYPE.NAME  or  (2)  CONCEPT-NAME  = 
CONCEPT.  NAME. 
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These  restrictions  ctiahh"  module  designers  to  prevent  UKorreci  u>.tge  and 
permit  static  checks  to  he  performed  For  example,  if  a  module  requires  tiso 
type  parameters,  the  module  designer  is  permitted  to  stale  the  constraint  that 
the  types  must  be  the  same 

To  provide  hierarchies  of  components,  concepts  can  enhance  oilier  concepts 
An  enhancement  "inherits”  the  definitions,  type-s.  and  operations  provided  !■> 
the  enhanced  concept.  Additional  operations  and  types  can  be  provided  by 
the  enhancement.  For  example,  below  we  specify  and  implement  an  operaiioii 
that  reverses  a  list.  The  reverse  operation  is  secondary  ineamng  that  it  is 
implemented  in  terms  of  operations  provided  by  the  List  'I'emplate 

concept  List_Reverse_Template  enhances  List_Template 

procedure  Reversefalters  L:  List) 

ensures  L.lelt  -  t.atnbda  and  L. right  =  (#L.left  o  •L,right)"R 
end  List_Reverse_Template 

RESOLVE  is  not  just  a  spe<  iliralioa  language  --  it.  also  jiermiis  multiple  real 
izalions  of  any  conccjil  to  he  written  The  RESOIA'E  iinidenieniaiioii  language 
has  many  noteworthy  feat  lift's  Assignment  (copying)  of  one  varialili-'s  value  to 
another  variable  is  not  a  pari  of  the  language;  instead,  swapping  the  values  of 
two  variables  is  lh<'  only  l>uilt-in  data  movement  primitive.  'J'liere  are  no  global 
variables,  fnstead,  operations  can  access  three  kinds  of  data  operation  param¬ 
eters;  local  variables;  and  module  varialdes  (static  variables  associaiisj  with  a 
jiarticuiar  module  iiislaiice  that  are  shared  among  operations  ex|>orled  by  that 
instance).  Aliased  variables  cannot  arise,  i.e.,  the  data  slructuri>  representing 
a  variable’s  value  can  only  be  known  by  one  name  at  any  tune  So  tyfies  are 
built-m;  therefore,  almost  every  statement  is  a  call,  since  every  manipulation 
of  a  variable  whose  type  is  provided  by  a  reusable  component  is  achieved  by  a 
call  to  a  facility  operation.  Modules  cannot  be  instantiated  dynamically,  le  , 
iiistanlialions  are  declarations  that  occur  outside  the  code  of  module  opera¬ 
tions,  and  all  instantiations  are  performed  when  a  program  begms  execution 
Furthermore,  the  types  of  variables  are  determined  statically,  and  there  are  no 
constructs  in  the  language  for  expressing  parallelism. 

3.2  Real-Time  Aspects  of  RESOLVE 

Languages  such  as  RESOIA’E  [5]  allow  niodule  developers  to  state  the  dura 
tioii  interval  for  operations  exported  by  the  modules.  While  this  information  is 
useful  for  reasoning  about  the  execution  time  of  composite  modules,  addillor',d 
language  constructs  are  required  to  describe  systems  of  real-time  proces.ses  V\'< 
allow  system  definition.s  to  contain  facility  declarations,  followed  by  global  vari¬ 
able  declarations,  and  then  process  specifications,  A  process  specification  con 
tains  facility  declarations,  variable  declarations  and  code  Additionally,  each 
process  description  may  mention  periodicity,  deadlines,  and  external  events 
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Periodicity  refers  to  tlie  frequency  with  whicli  repeatedly  executed  process 
are  performed.  The  syntax  for  denoting  a  periodic  process  is 

every  x  seconds  perform  f.operationlfa,  b.  c) 

I’he  period  of  the  process  is  stated  in  seconds  I  he  behavior  of  the  |>roi css  is 
programmed  as  the  operation  of  a  module, 

A  deadline  dictates  the  maximum  time  that  can  elapse  from  the  start  of  a 
proct^ss  invocation  until  its  completion  The  deadline  of  a  periodic  proci^ss  is 
slated  as; 

eveiy  x  seconds  perform  f  operation  1  (a,  b,  c)  before  y  s/'coiuls  elapse 

The  event-driven  process  is  an  important  pari  of  many  real-time  systems 
Such  a  jirocess  typically  has  deadlines.  Additionally,  there  is  usually  a  miiiimiim 
amount  of  time  that  mu.st  elapse  between  occurrences  of  an  event  We  allow 
event-driven  processes  to  be  codified  as 

on  event  e  perform  f operation t(a.  I*,  c)  before  y  s-Totuis  .-lapse  at 

most  every  i  seconds 

Another  construct  needed  is  the  tinier.  U  allows  system  developers  to  insctrt 
delays  into  a  processes  code  .so  that  activitii-s  in  the  controlled  eiiv  iromu.  iii  can 
1)0  performed  before  proceeding  with  function  of  ilie  process  'Jo  insert  a  delay, 
a  component  developer  writes  "delay  x  seconds” 

Devices  are  often  manipulated  by  real-time  systems  We  permit  this  by 
encapsulating  each  device  in  a  module  Thus,  device  slaii-s  may  be  inspected 
or  altered  simply  by  calling  a  module  operation 

4  Retrieval  of  Components  -  Local  Matching 

4.1  The  Basic  Algorithm 

Idilizing  the  structural  layout  of  the  repository  and  the  power  of  concepts  and 
realizations,  the  dynamic  aspects  of  the  Component  Manager  inclnde  all  the 
necessary  utilities  and  user  features  to  search  for  an  existing  module  in  the 
repository  and  to  classify  a  new  module. 

The  software  engineer  analyzing  and  refining  a  given  system  comfionent  is 
the  typical  user  interacting  with  the  tool.  He  searches  for  modules  fitting  his 
demands,  as  derived  during  the  development  process,  and  provides  the  library 
with  new  modules  not  included  in  the  structure  so  far  (9]. 

While  searching  in  the  Component  Manager  can  he  done  by  everybody  who 
needs  an  existing  system  component,  it  is  desirable  to  allow  only  a  specially 
trained  person  to  incorporate  new  components.  This  role  of  a  "comjionent 
administrator”  can  help  to  keep  the  system  consistent  and  can  avoid  violating 
[iredefined  integrity  or  orgatiizatioma!  rules 
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Searching  for  and  classifying  coinponeiiis  can  be  seen  ver>  imicli  as  the  sane- 
kind  of  process,  only  starting  at  different  ends.  As  the  discussion  of  both  pro¬ 
cesses  would  go  beyond  the  scope  of  this  paper,  only  the  retrieval  is  iiresi'iited 
in  some  detail  as  described  in  figure  {  fhe  algorithm  pr<>sented  here  i;-  an 
adapted  version  of  an  algorithm  presented  in  (8).) 

Starting  from  given  demands,  i.e.  the  specification  of  a  component,  the  first 
step  is  to  determine  on  which  level  of  decomposition  the  con![)onent  is  mo.st 
likely  to  be  situated.  This  level  of  decomposition  can  be  derived  from  the  level 
of  decomposition  the  development  process  is  currently  working  on 

At  the  determined  level  of  decomposition  exist  multiple  user-views  After 
picking  one  of  them,  perhaps  based  on  a  description  of  the  classification  si  rategy 
used  in  the  user-view,  a  first  approximation  of  the  comjioneiit  s<>arched  for  has 
to  be  located.  To  do  so,  look  for  cue  of  the  very  general  compoiieiit,s  on  loj) 
of  the  generalization  hierarchy  and  pick  the  one  which  is  most  similar  to  the 
specification  describing  the  demands.  If  tins  component  is  not  a  perfect  match, 
a  recursive  subtree  search  can  start  that  take.s  iliis  module  as  ii.s  starling  point. 
This  search  process  will  lead  to  more  specialized  levels  of  generairzation  willi 
more  atfrihutes  and  more  information. 

The  search  is  oriented  towards  the  ISA  relations  of  the  structure  and  follows 
a  path  leading  to  refined  but  similar  modules  (as  defined  by  the  IS.A-rclaiions). 
This  allows  appropriate  reasoning  about  dilTereiii  possible  solutions  [.'!]  and  lakes 
advantage  of  the  structured  search  space  |Tj.  The  (quality  of  the  results  obtaineil 
and  the  ease  of  use  of  the  tool  rely  to  a  certain  degree  on  tlx-  corre.spondence 
between  the  structure  of  the  development  process  and  the  structure  of  the  Com¬ 
ponent  Manager. 

Many  of  the  functions  incorporated  in  the  search  process  described  above 
can  be  optimized  if  they  are  interactively  guided  by  the  user,  for  example,  to 
determine  if  a  given  module  is  matching  or  not  can  be  done  best  by  the  user  who 
knows  exactly  his  current  needs  and  the  required  degree  of  similarity  However, 
in  principle  all  of  the  steps  described  in  the  algorithm  can  be  done  automatically. 
If  automated  support  is  asked  for,  the  component  specification  u.seo  as  a  goal 
description  must  be  a  concept  (or  a  realization)  described  in  RKSOf/VE,  to  be 
able  to  match  it  to  the  RESOLVE  concepts  in  the  descriptions  of  components 
in  the  repository. 

The  Component  Manager  merges  both  modes,  thus  providing  the  user  with 
a  hybrid  environment.  It  is  up  the  user  to  decide  what  to  do  by  hand  and  what 
to  have  done  by  the  machine.  Therefore,  both  extreme  cases  -  pure  manual  and 
pure  automatic  retrieval  -  are  still  possible. 

4.2  Automated  Local  Matching  of  Formal  Specifications 

Automated  local  matching  will  allow  a  user  to  invoke,  whenever  necessary  or 
convenient,  the  automated  version  of  the.se  supporting  algorithms  providing 
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PROC  RETRIEVE-COMPONENT  (LIBRARY,  COMPONEN  I  -SPEC) 

FOUND  :=  NO-MATCH. 

D ETERM IN E  DECOM POSI'flON- LE V EL 
REPEAT 

DETERMINE  USER-VIEW 
REPEAT 

FIND-SIMILAR  (COMPONENT-SPEC, STARTINC-POlNTj 
REPEAT 

E  VA L U  A'l  ECCOM PO N  EN'J  -SP I’C,STA  R  LI N Ci-  POIN  I  ) 

IF  (FOUND  i:  MATCH)  THEN 

SUBTREE-SEARCIKCOMPONENT-SPEC.STARUNC  PO!M  i 
FI 

UNTIL  (FOUNDS MATCH)  OR 

(NO  MORE  ALTERNATIVE  STAR!  INC-POIN  PS) 

UNTIL  (FOUND=MATCH)  OR 

(NO  MORE  ALTERNATIVE  USER-VIEWS). 

UNTIL  (FOUND=MATCH)  OR 

(NO  MORE  ALTERNATIVE  DECOMPOSITION-LEVELS) 

CORP  RETRIEVE-COMPONENT 

PROC  SUBTREE-SEARCH  (CURRENT-SPEC, ANCESTOR). 

WHILE  (FOUND  /  MATCH)  AND 

(EXISTS  UNEVALUATED  DEPENDENT-COMPONENT) 

DO 

EVALUATE(CURRENT-SPEC,DEPENDENT-COMPONENT) 

OD. 

IF  (FOUND  9^  MATCH)  THEN 

WHILE  (FOUND  #  MATCH)  AND 

(EXISTS  DEPENDENT-MODULE  NOT  USED  AS  NEW-ANCESTOR) 

DO 

FIND-SIMILAR  (CURRENT-SPEC,  NEW-ANCES'l OR) 
SUBTREE-SEARCH  (CURRENT-SPEC,NEW-AN('ESTOR ) 

OD. 

FI. 

CORP  SUBTREE-SEARCH. 


Figure  2;  Searching  in  the  f 'oniponent  Manager 
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him/her  either  with  the  "second  opinion  of  another  expert”  or  witli  the  help  of 
a  "clerk”  who  is  taking  over  in  a  well  defined  situation. 

We  expect  a  retrieval  style  that  will  utilize  especially  the  automated  version.s 
of  EVALUATE  (decide  if  a  match  is  given),  FIND-SIMILAR  (decide  on  the 
level  of  similarity),  and  SUBTREE-SEARCH  (evaluate  all  components  in  a 
given  subtree  of  a  user-view’s  generalization  hierarchy);  see  figure  2  All  these 
processes  are,  from  the  point  of  view  of  their  control  structure,  straight  forward 
list  or  backtraking  algorithms.  The  critical  point  is  to  decide  if  and  to  what 
degree  a  match  is  given. 

To  decide  if  a  component  is  a  (local)  match  means  to  compare  the  needs 
statement  with  the  component  description.  In  the  case  of  an  automated  re¬ 
trieval,  the  needs  statement  is  a  concept  or  realization  as  described  for  R.l->^ 
SOLVE.  Thus,  the  comparison  is  done  on  the  basis  of  the  formal  (RESOLV'E) 
specifications  stored  in  the  component  manager. 

The  user  of  the  component  manager  supplies  a  specification  of  the  module 
that  is  required  from  the  library.  The  requirement  specification  does  not  need 
to  be  a  complete  specification,  hut  the  the  likelihood  of  finding  the  d<>sireii 
component  increases  with  the  degree  of  completeness  of  the  specification.  When 
the  requirement  specification  is  matched  against  a  component  in  the  library,  tlv 
user  selects  the  realization  of  the  component  in  the  library  that  suits  his  needs 

Matching  one  specification  against  another  is  in  gcneial  an  unsolvable  prob¬ 
lem.  However,  in  a  great  many  cases  the  problem  can  be  solved,  at  least  par¬ 
tially,  A  grammar  for  the  syntax  of  concepts  has  been  defined,  permitting  tin' 
compilation  of  specifications.  Tlie  assertions  are  machine  proccssable  if  the 
mathematical  theories — their  notations  and  operations — are  formally  defined, 
and  if  a  base  set  of  logic  rules  are  hard-wired  into  the  processor. 

Many  specifications  describe  abstract  data  type  modules.  To  malcli  two 
specifications,  one  can  check  whether  the  mathematical  models  of  the  two  ar<' 
the  same.  The  number  and  kind.s  of  parameters  to  the  moduUs  provide  other 
features  for  comparison.  The  number  of  operations  can  also  be  compared,  as 
can  the  interfaces,  and  the  pre-conditions  and  post-conditions.  One  approacli 
to  making  the  comparisons  is  to  translate  the  assertions  about  the  operations’ 
pre-conditions  and  post-conditions  to  assertions  about  a  cannonical  mathemat¬ 
ical  model,  such  as  set  theory.  This  has  the  advantage  that  if  a  requirements 
specifier  chooses  a  different  model  for  a  data  type  than  the  specifier  of  the 
library  component  chose  for  the  same  component,  matching  is  simplified.  How¬ 
ever,  translation  to  a  cannonical  form  may  increase  the  number  of  terms  in  each 
a.ssertion,  thus  increasing  the  cost  of  comparison 
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5  Evaluation  of  Components  -  Global  Match¬ 
ing 

While  to  match  a  non-real-time  component  specification  can  be  done  in  a  lo 
cal  sense,  at  the  interface  level,  to  match  a  real-time  component  specification 
requires  more  work  since  the  introduction  of  such  a  component  introduces  iridi 
rect  timing  effects  on  the  rest  of  the  real-time  application.  Predicting  real-time 
performance  before  the  application  is  actually  running  is  referred  to  as  schedula- 
biltiy  analysis  [14,  16],  Even  when  done  for  programs  written  in  languages  that 
conform  to  time-constrained  real-time  languages,  exact  schedulability  analysis 
is  NP-complete.  Essentially,  existing  accurate  algorithms  are  exponential  in  the 
number  of  alternate  conditional  branches  found  in  real-time  programs  and,  in 
the  case  of  parallel  real-time  systems,  in  the  number  of  PEs  and  software  com¬ 
ponents  in  consideration.  Furthermore,  even  polynomial-time  algorithms  may 
still  be  computationally-prohibitive,  when  the  number  of  PEs  and  components 
is  very  large,  which  is  often  the  case  with  many  modern  real-time  systems  (such 
as  those  in  the  C^l  domain  of  applications,  for  example). 

We  make  the  assumption  that  the  code  of  components  is  amenable  to  static 
analysis,  as  in  Real-Time  Euclid  [6].  We  assume  that  loops  have  been  un¬ 
rolled,  that  no  recursion  is  used,  and  that  conditionals  have  been  balanced  and 
transformed  [17,  18]  to  eliminate  the  number  of  alternate  paths  schedulabil¬ 
ity  analysis  [14,  16]  has  to  consider.  We  also  make  the  assumption  that  the 
call-DAG  of  components  is  statically-known.  We  require  that  the  sizes  of  all 
variables  and  object  states  be  statically  determinable.  In  particular,  we  employ 
standard  techniques  used  in  RPC  and  distributed  system  implementations  (such 
as  in  SUPRA-RPC  [15],  among  others),  lo  compute  the  size  of  each  operation 
parameter.  The  direction  of  each  parameter  (IN,  OUT  or  INOUT)  is  either 
available  from  the  language  definition  or  is  provided  as  a  remote  call  annotation 
Consequently,  each  component  specification  contains  sufficient  information  that 
describes  how  much  time  each  operation  of  the  component  takes  to  execute, 
what  other  operations  of  what  components  each  operation  of  the  component 
calls  and  how  many  times,  and  how  much  data  needs  to  be  transmitted  in  each 
direction  on  each  call. 

Given  the  interface  description  for  the  component  being  considered  for  reuse, 
a  performance  estimate  of  the  entire  system  is  undertaken  in  three  steps.  First, 
the  demand  for  each  resource  (PE,  link,  .,eiisor  and  so  forth)  in  the  system  is 
projected.  For  every  resource  at  the  node  where  the  component  will  reside,  tbi,'* 
demand  is  computed  according  to  polynomial-tirnc  heuristics,  which  project  ac 
curate  accumulated  execution  or  communication  demand  due  to  every  assigned 
component  within  a  certain  interval  of  time  (such  as  the  leasl-common-multiple 
of  the  periods  of  all  real-time  processes  using  the  component)  and  estimate  such 
demand  due  to  the  components  that  have  not  yet  been  assigned.  (Note  tbai 
even  these  are  heuristics  in  the  sense  that  the  demands  are  accurate  in  the  ac- 
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cumulated  sense  only  and  not  necessarily  at  any  parucular  instance  of  time.) 
Should  there  be  sufficient  time,  the  same  heuristics  are  applied  to  nodes  and 
links  neighboring  this  node.  In  systems  with  a  large  number  of  nodes,  the  cor¬ 
responding  demands  (with  the  contribution  of  the  component  in  consideration) 
are  only  estimated  in  the  sense  that  individual  nodes  are  not  considered  but 
groups  of  nodes  are,  as  ‘mega-nodes”  (one  such  mega-node  includes  the  node 
of  the  component  and  its  immediate  neighbors).  Should  there  be  a  very  large 
number  of  nodes,  then  the  mega-nodes  are  combined  into  “mega-mega-nodes” 
and  so  forth. 

Once  accumulated  demands  for  every  resource  have  been  estimated,  they 
are  easily  converted  into  utilizations  (by  dividing  over  the  size  of  the  same  time 
interval  over  which  they  have  been  accumulated).  Should  any  utilization  exceed 
100%,  the  component  is  rejected  as  too  time-consuming  for  the  system.  Oth¬ 
erwise,  the  utilizations  are  in  turn  used  to  compute  progress  rates  incurred  by 
individual  or  groups  of  processes  when  attempting  to  use  a  particular  resource. 
Each  rate  is  estimated  as  an  expected  value,  wiiere  the  expected  probabilities 
are  the  probabilities  that  (1)  no  request  is  made  for  the  resource,  (2)  this  process 
is  the  only  one  making  a  request,  and  (3)  other  processes  made  their  requests 
when  this  component’s  request  has  come.  The  expected  values  for  each  prob¬ 
ability  are,  respectively,  100%  (rate  of  progress),  100%  and  the  fraction  of  this 
process's  contribution  to  the  total  demand  for  the  resource. 

Finally,  response  times  of  each  process  or  a  group  of  processes  are  computed 
as  sums  of  ratios  of  each  process’s  contribution  to  the  total  demand  for  a  resource 
over  the  rate  of  progress  of  the  process  for  this  resource,  for  every  resource.  The 
response  times  are  contrasted  with  the  corresponding  process  periods.  Should 
a  response  time  exceed  a  period,  the  performance  prediction  heis  identified  a 
|)Otential  mis,sed  deadline,  and  the  component  is  rejected  as  too  time-consuming 
Should  there  be  multiple  components  chosen  by  the  same  local  match,  the  one 
which  maximizes  the  laxities  (computed  as  sums  of  differences  between  periods 
and  projected  response  times)  in  the  system  is  chosen. 

To  further  speed  up  the  performance  estimation,  demands,  utilizations,  rates 
of  progress  and  response  times  are  computed  incrementally.  Typically,  while  rel¬ 
atively  more  work  is  needed  to  update  the  values  of  these  metrics  at  the  PE 
where  the  component  is  to  reside,  little  extra  work  is  needed  for  the  mega-  or  the 
mega-mega-  etcetera  nodes.  Furthermore,  in  .systems  where  reusable  component 
.selection  is  combined  with  the  selection  of  the  node  to  assign  the  component  to. 
t.ie  performance  estimate  incorporates  a  fast  assignment  algorithm  that  consid¬ 
ers  a  fixed  number  of  nodes  from  among  tlie  least  utilized  ones.  The  overall 
performance  estimating  procedure  runs  in  fast  polynomial-time,  and  expected 
to  provide  good  predictions  of  run-time  real-time  performance.  A  quantitative 
evaluation  of  the  procedure  is  in  progress 
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6  Conclusions 


Exploiting  the  synergy  between  component  library  management  techniques  (tlie 
Software  Archive  [11, 9, 12])  and  formal  component  specification  (the  RESOLVE 
specification  language  [5,  2,  4]),  the  Component  Manager  strives  to  realize  a  hy¬ 
brid  tool  supporting  fully  automated,  partially  automated,  and  manual  retrieval 
of  real-time  components.  Going  beyond  the  level  of  single  components,  a  possi¬ 
ble  match  is  evaluated  with  regard  to  its  influence  on  the  timing  properties  of 
the  target  system,  using  the  concepts  of  schedulability  analysis. 

Ongoing  research  includes  the  testing  of  RESOLVE’S  practicability  for  spec¬ 
ifications  and  designs.  Work  is  also  in  progress  to  integrate  RESOLVE  into  the 
prototype  of  the  repository  and  to  refine  its  real-time  extensions  to  serve  the 
needs  of  the  presented  two-stage  matching  algorithm. 
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Abstract 

The  design  of  an  embedded  system  to  control  the  operations  of  a  large  complex  real-time 
system  requires  a  comprehensive  framework  to  meet  the  diversity  of  requirements.  Such  an 
embedded  system  must  provide  support  for  real-time  activities  in  a  fault-tolerant  and  distributed 
manner.  In  this  paper,  we  discuss  the  design  of  such  a  system,  called  MARUTI,  and  the  guiding 
philosophy  behind  it. 

MARUTI  is  a  time-driven  system,  in  which  resources  are  reserved  for  the  real-time  tasks 
prior  to  execution.  As  much  information  as  possible  is  gathered  about  resource  and  timing 
requirement  of  a  task  so  that  appropriate  temporal  resource  binding  can  be  done.  The  resource 
allocation  and  scheduling  scheme  was  developed  for  a  distributed  system  in  which  each  node 
may  be  a  multiprocessor. 

Fault  tolerance  is  achieved  through  the  development  of  resilient  applications  in  a  user- 
transparent  way  according  to  a  specified  resiliency  degree.  Active  redundancy  is  used  in  order  to 
reduce  the  recovery  latency.  The  resilient  applications  are  allocated  onto  the  distributed  system 
taking  ’n*,o  consideration  the  timing  and  fault  tolerance  constraints  as  well  as  the  characteristics 
of  the  .distributed  environment. 

The  MARUTI  system  has  been  designed  to  assess  .the  applicability  of  techniques  for  real-time, 
distributed,  fault  tolerant  systems  in  a  cohesive  and  comprehensive  environment. 
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policy,  or  decision,  unless  so  designated  by  other  official  documentation. 
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1  Introduction 


Many  large,  complex  systems  consist  of  numerous  components  which  must  interact  in  a  controlled 
manner.  An  embedded  computer  system  is  used  to  perform  the  control  functions  necessary  for 
successful  deployment  and  operation  of  such  systems.  In  order  to  meet  the  control  requirements, 
the  embedded  computer  system  must  provide  support  for  reliable  operation  of  the  software  which 
runs  on  it.  Further,  the  software  must  operate  within  real-time  constraints  to  monitor  and  generate 
control  signals. 

Many  of  the  embedded  systems  have  been  designed  using  the  event  driven  approach  in  which 
the  system  reacts  to  the  events  from  within  or  those  generated  by  the  environment.  Such  a  system 
executes  the  appropriate  software  in  reaction  to  the  these  events.  The  events  are  processed  on  the 
basis  of  priorities  assigned  statically  or  dynamically.  However,  such  priority  based  operation  does 
not  always  guarantee  a  timely  e.xecution.  In  a  time  driven  system,  all  executions  are  carried  out 
during  specific  time  intervals  chosen  to  assure  the  timely  execution.  In  this  paper  we  present  the 
basic  philosophy  behind  the  design  of  maruti,  a  time  driven  system  that  operates  in  a  distributed 
environment  while  assuring  the  requested  fault  tolerant  behavior. 

An  embedded  system  in  operation  usually  e.xecutes  a  limited  set  of  applications  in  a  restricted 
environment.  Not  all  such  systems  are  closed  systems  in  which  all  applications  and  their  execution 
characteristics  are  known  at  the  design  time.  Many  of  them  have  to  accept  processing  requests  made 
during  the  operation  of  the  system.  The  methodology  developed  to  design  an  embedded  real-time 
system  must  encompass  many  characteristics.  Since  embedded  systems  have  varying  requirements, 
the  design  must  be  general  enough  to  be  able  to  adapt  to  a  large  variety  of  operational  environments. 
On  the  other  hand,  it  must  be  able  to  support  the  implementation  of  control  functions  of  a  specific 
system  in  an  efficient  manner. 

When  an  embedded  system  has  to  meet  real-time  requirements  and  provide  support  for  fault- 
tolerant,  distributed,  and  heterogeneous  operation,  many  of  the  techniques  developed  for  addressing 
these  requirements  in  isolation  are  often  contradictory.  Systems  must  be  designed  taking  into 
account  aU  these  requirements  and  integrating  their  solutions  throughout  the  system  design.  In 
MARUTI  we  are  attempting  to  address  these  requirements  in  a  comprehensive  manner. 

The  starting  point  in  the  design  of  MARUTI  has  been  a  careful  consideration  ^fthe  characteristics 
of  the  applications  to  be  supported  on  it.  In  the  next  section  we  present  a  brief  description  of 
the  application  characteristics  and  the  resulting  system  requirements.  The  design  has  followed  a 
consistent  philosophy  which  is  presented  in  Section  4.  This  is  followed  by  the  user’s  view  of  the 
application.  In  Section  6  we  present  the  process  to  be  used  in  the  development  and  operation 
of  applications  taking  into  consideration  all  the  requirements  the  applications  may  have.  Some 
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concluding  remarks  are  presented  in  Section  7. 


2  Related  Work 

In  order  to  put  our  work  in  perspective,  in  this  section  we  discuss  a  number  of  other  efforts  which  are 
aimed  at  meeting  the  requirements  of  real  time  operating  systems.  The  discussion  is  organized  along 
the  lines  of  major  requirements  and  how  they  are  addressed  by  ARTS[19],  Alpha[5],  CH.40S[15, 16], 
HART0S[9],  Spring[18],  MARS[7]  and  RT-Mach[20|  systems.  Research  in  ARTS,  RT-Mach  and 
Spring  is  directed  towards  predictable  service  in  distributed  hard  real-time  environment.  CHAOS 
is  designed  to  support  adaptive  hard  real-time  applications,  while  Alpha  is  designed  for  supporting 
mission  critical  computing  in  large,  complex,  distributed  systems  with  a  benefit- accrual  model  for 
real-time.  Research  in  HARTOS  is  focused  on  fault-tolerant  communication  in  a  hexagonal  mesh 
interconnection  network. 

Hard  Real-time:  Commercial  real-time  operating  systems,  such  as  Real  Time  Unix  (RTU)  and 
LynxOS,  address  the  real-time  requirements  by  providing  an  interrupt-driven  kernel,  fast  context 
switching  and  a  priority-driven  scheduler.  They  do  not,  however,  provide  hard  real-time  guarantees. 

The  Spring  kernel  and  the  MARS  system  are  time-driven  systems  that  pre-schedule  the  critical 
(hard  real-time)  tasks.  These  tasks  are  the  only  ones  guaranteed  to  execute  within  their  deadlines. 
.4RTS  and  RT-Mach  use  fixed  priority  scheduling  with  static  schedulability  analysis  to  ensure  that 
deadlines  of  hard  real-time  tasks  will  be  satisfied  at  execution  time.  CHAOS  and  Spring  support 
dynamic  real-time  scheduling  for  online  guarantees. 

Distributed  and  Heterogeneous  Operations: 

Several  of  the  real-time  systems,  such  as  ARTS,  RT-Madi,  Alpha,  Spring  and  MARS,  provide 
support  for  distributed  operations  but  not  heterogeneity.  Spring  uses  homogeneous  multiprocessor 
hardware  with  shared  memory,  while  ARTS  operates  on  top  of  Mach  in  a  distributed  homoge¬ 
neous  environment.  RT-Mach  is  a  modified  version  of  MACH[1]  and  also  provides  a  distributed 
environment. 

Fault  tolerance:  Support  for  fault  tolerance  under  real-time  constraints  has  not  been  exten¬ 
sively  researched.  ARTS  provides  support  for  exception  handling  as  well  as  timing  errors  whereas 
Alpha  provides  data  replication  as  well  as  process  migration  for  fault  tolerance.  However  they  do 
not  address  the  impact  of  these  techniques  on  timing  characteristics  of  applications.  MARS  sup¬ 
ports  hardware  level  fault  tolerance,  using  active  replication.  HARTOS  provides  for  faiilt-tolerant 
communication. 

Clearly  most  of  these  systems  address  only  a  subset  of  the  requirements  for  advanced  hard 
real-time  operating  systems.  The  goal  of  maruti  is  to  develop  a  comprehensive  framework  for 
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addressing  the  hard  real-time  and  fault  tolerance  requirements  in  a  heterogeneous,  distributed 
computing  environment. 

3  Application  Characteristics  and  System  Requirements 

The  design  of  a  real-time  system  takes  into  consideration  the  primary  characteristics  of  the  appli¬ 
cations  which  are  to  be  supported.  tVote  that  these  characteristics  are  derived  not  just  from  the 
real-time  applications  implemented  today  but  also  those  anticipated  [17].  The  characteristics  of 
real-time  applications  which  play  a  crucial  role  in  the  design  of  the  operating  system  are  identified 
below. 


Timing  Constraints  Real-time  applications  have  various  kinds  of  timing  constraints.  In  hard 
real  time  applications,  tasks  must  be  completed  by  the  specified  deadline  for  correctness. 
Overrunning  a  deadline  is  considered  a  failure  and  cannot  be  tolerated.  Although,  soft  real 
time  applications  also  have  deadlines,  overrun  deadlines  can  be  tolerated  to  a  certain  e.xtent. 
.4  penalty  is  incurred  when  a  deadline  is  missed  in  soft  real-time  applications.  In  additic  n. 
the  real-time  system  may  also  be  required  to  execute  some  jobs  which  do  not  have  any 
timing  constraints.  In  many  system  the  hard  real-time,  soft  real-time  as  well  as  non  real-time 
applications  must  coexist. 

Criticality  Many  real  time  applications  are  safety  critical.  Examples  of  such  systems  include 
nuclear  power  plants,  life  support  systems,  etc.  A  failure  to  perform  the  critical  tasks  suc¬ 
cessfully  can  result  in  disastrous  consequences.  A  real-time  system  must  provide  support  for 
fault  tolerance  and  exception  handling  capabilities  for  increased  reliability  and  tolerance  to 
failures,  while  continuing  to  satisly  the  timing  requirements. 

Distributed  Real  time  applications  envisaged  today  are  distributed  in  nature  and  some  of  their 
components  must  execute  concurrently.  A  natural  way  to  support  this  requirement  is  to  have  a 
distributed  system,  which  is  a  collection  of  processing  nodes  connected  via  an  interconnection 
network.  Each  node  in  the  system  may  consist  of  a  variety  of  resources  including  one  or  more 
processors  which  may  be  of  heterogeneous  architecture.  The  distribution  of  tne  system  may 
also  be  required  to  support  fault  tolerant  operation. 

Deterministic  Execution  Profile  Hard  real  time  applications  require  deterministic  guarantees, 
and  thus  worst  case  bounds  on  their  execution  times  and  resource  requirements  mu.st  be  known 
for  appropriate  resource  allocation.  This  forbids  the  use  of  unbounded  loops  and  recursion 
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4  Mar  lit  i  Principles 

Recognizing  the  cornjiiexity  of  th*'  rrquirc-itf'n’s  posed  l>y  the  rhar.icicristics  of  at  ion*  «<• 

formulated  a  few  l>aiic  guiding  principles  for  tie-  •irsign  of  i.!ARt  it  in  thi*’  M'lijt.ri  jur'cnt  I'r,  ., 
principles  and  a  bri'-f  rationale  for  them. 

Principle  1  Titni  in-  un  >  'Smfnn'  atlnhut’  t>f  tvrry  t  nitty  in  ihn  .'■ysit  m 

In  a  time  driven  system,  it  is  necessary  that  all  parts  of  tie-  syciein  have  a  uruforni  and  oinsi-  t*  .et 
view  of  time  and  a  way  of  handling  it.  In  mari'H.  lime  is  lreale<i  as  an  attribute  of  •m  :v  f 
and  all  operations  are  carried  out  with  respect  to  time.  This  approach  not  only  a"i;?e;  tie  tirre 
driven  operation  but  also  permits  rea.soning  abou;  the  temporal  properties  of  tin-  .■  vstmi 

Principle  2  Do  an  muc/i  wi/yk  for  prrpnring  nn  applicalion  Jar  fifciitiou  o.^  rarly  ns  jK.iss-,l:i’,r . 

It  is  difficult  to  predict  and  account  for  operating  system  overheads  in  a  demand  srhedubng 
model.  In  a  real  time  sy.stcm  tlii.s  adversely  affects  the  required  deterministic  guarantee-.'.  To  aesuri- 
real-time  operation  the  operating  system  overheads  at  run  time  must  be  predictable  and  minimized 
for  efficiency  purposes.  Therefore,  during  run  time,  real  time  operating  systems  should  perform 
only  those  tasks  that  are  strictly  necessary. 

In  MARUTI,  the  application  development  process  goes  through  several  phases.  All  tlie  compo¬ 
nents  of  an  application  contain  the  placeholders  for  all  the  resource  and  timing  information.  The 
values  to  these  placeholders  are  assigned  as  early  during  the  development  phase  as  they  can  bf* 
assessed,  refining  them  as  the  application  development  process  proceeds. 

Principle  3  All  resources  needed  by  a  hard  real-time  application  must  be  reserved  prior  to  execu¬ 
tion. 
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in  the  software,  whicli,  in  general,  characterize  non-deterministic  resource  consumption.  Fur¬ 
ther,  to  prevent  unbounded  waits,  the  synchronization  of  the  programs  should  be  known  and 
addressed  explicitly. 

Scenarios  Many  real  time  systems  operate  in  certain  well  defined  modes,  which  we  call  srx-narios. 
The  system  exists  in  any  one  scenario  at  a  time  but  may  switch  scenarios  when  a  triggering 
condition  becomes  true.  This  requires  the  system  to  have  the  capability  to  switch  between 
different  scenarios. 


4  Maruti  Principles 

Recognizing  the  complexity  of  the  requirements  posed  by  the  characteristics  of  applications  we 
formulated  a  few  basic  guiding  principles  for  the  design  of  MARUTI.  In  this  section  we  present  these 
principles  and  a  brief  rationale  for  them. 

Principle  1  Time  must  be  an  essential  attribute  of  every  entity  in  the  system. 

In  a  time  driven  system,  it  is  necessary  that  all  parts  of  the  system  have  a  uniform  and  consistent 
view  of  time  and  a  way  of  handling  it.  In  maruti,  time  is  treated  as  an  attribute  of  every  entity 
and  all  operations  arc  carried  out  with  respect  to  time.  This  approach  not  only  assures  the  time 
driven  operation  but  also  jiermits  reasoning  about  the  temporal  properties  of  the  system. 

Principle  2  Do  as  much  luork  for  preparing  an  application  for  execution  as  early  as  possible. 

It  is  dilTicult  to  predict  and  account  for  operating  system  overheads  in  a  demand  scheduling 
model.  In  a  real  time  system  this  adversely  affects  the  required  deterministic  guarantees.  To  assure 
real-time  operation  the  operating  system  overheads  at  run  time  must  be  predictable  and  minimized 
for  efficiency  purposes.  Therefore,  during  run  time,  real  time  operating  systems  should  perform 
only  those  tasks  that  are  strictly  necessary. 

In  MARUTI,  the  application  development  process  goes  through  several  phases.  All  the  compo¬ 
nents  of  an  application  contain  the  placeholders  for  all  the  resource  and  timing  information.  The 
values  to  these  placeholders  are  assigned  as  early  during  the  development  phase  as  they  can  be 
assessed,  refining  them  as  the  application  development  process  proceeds. 

Principle  3  All  resources  needed  by  a  hard  real-time  application  must  be  reserved  prior  to  execu¬ 
tion. 
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Clearly,  adherence  to  this  principle  is  essential  to  guarantee  the  successful  execution  of  hard 
real-time  applications  for  which  a  guarantee  can  only  bo  given  if  all  resources  needed  can  be  made 
available  in  a  timely  manner. 

We  consider  a  semi-dynamic  model  for  real-time  systems,  in  which  a  job  is  submitted  for 
processing  prior  to  its  release  time  (earliest  start  time).  The  time  between  the  submission  of  a  job 
and  its  release  time  can  be  used  by  the  system  for  carrying  out  the  resource  allocation  without 
affecting  the  timely  operation  of  the  application. 

In  order  to  accept  a  real  time  job  for  execution,  the  system  performs  resource  allocation  for  that 
job.  For  hard-real  time  jobs,  if  all  resources  necessary  for  the  execution  are  available  within  the 
time  constraints  requested,  the  job  is  accepted  for  real-time  execution.  Otherwise  the  job  request 
is  denied.  Soft-real  time  jobs  however,  do  not  need  such  deterministic  guarantees.  Non-real  time 
jobs  are  run  on  time  and  resource  availability  basis  and  do  not  require  resource  reservation.  In  this 
paper,  we  primarily  focus  on  hard  real-time  operation. 

Communication  is  a  vital  part  of  most  real-time  applications  and  thus  the  required  resources  for 
communication  must  be  reserved  as  well.  A  special  case  of  communication,  namely  synchronization 
between  tasks,  is  also  achieved  via  appropriate  scheduling  and  resource  reservations. 

Principle  4  Supimrl  for  fault  tolerance  is  an  integral  part  of  the  system. 

Most  real  time  applications  are  of  critical  nature  and  operate  in  an  unreliable  environment.  It 
becomes  imperative  that  the  system  provide  adequate  execution  support  despite  the  presence  of 
failures.  Fauk  tolerance  is  treated  as  an  essential  aspect  of  maruti.  Support  for  fault  tolerance  is 
uniform,  starting  at  the  lowest  level  and  defining  error  handlers  and  different  plans  of  actions  at 
each  level.  The  fault  handling  can  be  carried  out  in  real-time  or  non-real  time  manner  as  needed 
by  a  particular  application.  These  characteristics  permit  a  systematic  methodology  of  execution  in 
a  fault  tolerant  mode,  rather  than  in  an  ad-hoc  manner. 

5  User  View  of  Applications 

In  this  section  we  describe  the  user  view  of  the  applications,  i.e.,  a  brief  overview  of  the  programming 
paradigm  as  well  as  the  capabilities  available  to  the  user. 

An  application  program  is  a  collection  of  cooperating  software  modules.  The  modules  may 
Communicate  with  each  other  using  shared  resources  or  message  passing.  Communication  through 
message  passing  can  be  synchronous  (i.e.,  remote  procedure  call)  or  asynchronous  (i.e.,  one  way 
invocation). 
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The  user  specifies  timing  constraints  for  the  entire  program  and  may  also  specify  timing  con¬ 
straints  for  individual  modules.  The  constraints  for  a  module  can  be  expressed  in  terms  of  absolute 
time  or  can  be  specified  relative  to  the  timing  constraints  of  other  modules. 

The  relationship  between  software  modules  may  be  specified  as  precedence,  exclusion  or  other 
constraints  such  as  simultaneity  (i.c.,  modules  must  execute  simultaneously)  and  indivisibility  (i.e., 
one  module  must  execute  after  another  with  no  loss  of  state  from  the  first  module)[22,  14]. 

Fault  tolerance  requirements  can  be  specified  by  indicating  the  resiliency  degree.  The  resiliency 
can  be  specified  on  an  individual  module  basis  or  for  the  entire  application  program.  Upon  the 
detection  of  a  fault,  user  specified  actions  (or  system  defaults)  arc  taken  to  handle  the  fault. 
Mechanisms  are  also  provided  to  specify  exceptions  and  exception  handlers. 

A  named  collection  of  programs  constitutes  a  scenario.  Each  scenario  defines  a  mode  of  opera¬ 
tion  of  a  real-time  system  and  contains  all  programs  necessary  to  operate  in  that  mode.  Scenarios 
can  also  be  seen  as  sudden  changes  in  the  executing  programs,  in  response  to  some  stimuli.  Sce¬ 
narios  are  useful  in  situations  where  there  arc  limited  resources  for  execution  of  multiple  programs, 
or  where  programs  are  mutually  exclusive. 

In  our  programming  paradigm  we  do  not  allow  non  determ  inis  tic  behavior,  due  to  the  real-time 
requirements.  This  imposes  some  discipline  that  the  programmer  must  observe  (see  section  3). 

6  Maruti  Mechanisms  and  Structures 

MARUTI  has  been  designed  using  a  consistent  set  of  mechanisms  and  models.  In  this  section  wo 
present  the  framework  used  in  this  system.  Let  us  consider  the  resource  model  used  in  MARUTI. 

6.1  Resource  Model 

In  our  model,  the  resources  are  divided  into  two  types;  active  and  passive.  An  active  resource 
is  capable  of  autonomous  operation,  for  example  CPUs  and  DMA  devices.  Passive  resources  arc 
the  storage  devices  (e.g.,  memory  and  secondary  storage),  which  are  used  by  the  active  resources. 
Passive  resources  may  be  shared  and  are  also  used  for  communication  between  modules  in  the  same 
node.  Note  that  both  active  and  passive  resources  are  required  for  internode  communication. 

Active  resources  arc  partitioned  into  disjoint  groups  called  meargs  (Mutually  Exclusive  Active 
Resources  Groups).  Each  active  resource  belongs  to  exactly  one  mearg.  The  set  of  active  resources 
available  in  a  distributed  system  may  now  be  considered  as  a  set  of  meargs.  A  task^  executes  using 
the  resources  of  one  MEARG  and  a  set  of  passive  resources. 

task  is  a  baisic  executable  entity.  In  section  6.2.2,  we  shoiv  how  tasks  are  obtained  from  application  programs. 
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The  schedule  for  each  MEAac  and  for  each  passive  resource  is  kept  in  a  structure  called  a 
ca/endar{10].  The  calendar  can  be  viewed  as  a  sequence  of  non-overlapping  time  intervals,  each  of 
which  has  a  task  associated  with  it  (the  task  that  will  execute  during  that  interval).  If  a  resource 
can  only  be  used  exclusively,  an  interval  can  have  only  one  task  associated  with  it. 


6.2  Model  of  Computation 

To  facilitate  resource  allocation,  the  system  view  of  an  application  program  differs  from  the  user 
view.  This  distinction  comes  about  due  to  the  different  granularity  of  elements  that  compose  a 
program.  In  this  section  we  describe  the  basic  building  block  of  a  program,  how  to  create  a  program, 
how  to  add  fault  tolerance  to  it,  and  finally  how  to  allocate  resources  for  it. 


6.2.1  Elemental  Units  • 


In  the  system  view,  the  building  block  of  an  application  program  is  an  elemental  unit  (EU).  When 
an  EU  is  scheduled,  we  refer  to  it  as  a  task.  A  task  is  then  the  basic  entity  for  execution.  A  module 
requires  software^,  state,  and  resources  to  execute  on  a  given  input  to  produce  output.  All  these 
are  encapsulated  into  an  EU  along  with  constraints  on  input  and  output  (Figure  1). 


monitors 


services 

— ^  data  and  synchronization 


Figure  1;  Elemental  Unit 

Input  conditions  are  boolean  expressions  on  the  input  data,  state,  time,  and  capabilities  which 
are  evaluated  to  trigger  services  provided  by  the  EU.  Similarly,  output  conditions  are  checks  on 
outgoing  data  and  state,  as  well  as  on  the  timing  requirements  of  the  EU.  In  addition,  input  and 
output  data  may  be  correlated  before  generating  results. 

The  structure  of  EUs  makes  it  is  possible  to  accomplish  uniform  treatment  of  both  computation 
and  communication  services  (Figure  la).  For  example,  a  computational  EU  is  typically  determined 
by  the  data  received  and  generated,  by  their  corresponding  validity  checks,  and  by  a  program 
that  requires  a  state  and  hardware  resources  to  operate.  A  communication  EU  is  analogous,  where 
input  and  output  data  are  the  messages,  input  conditions  may  be  empty,  output  conditions  check  for 
message  corruption,  the  software  is  comprised  of  the  data  buffering  and  data  movement  protocols, 
and  the  hardware  requirement  consists  of  buffers,  communication  links  (  for  remote  communication), 

^Throughout  this  work  we  assume  that  the  software  is  reentrant. 
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and  processor  to  execute  the  software.  Another  aspect  of  computations,  namely  synchronization, 
can  also  be  accomplished  by  requiring  the  input  condition  to  be  satisfied  only  when  all  messages 
have  arrived  (see  Figure  2). 


©  © 

ty 

© 

{^) 

(b) 

Figure  2:  (a)  EU  communication  and  (b)  EU  synchronization 

Input  and  output  data  are  the  pieces  of  information  transmitted  between  EUs  or  that  come 
from  the  environment.  In  addition  to  the  messages,  files  and  other  environment  data  (c.g.,  current 
time  or  other  system  parameters)  can  also  be  considered  as  input  and  output  data. 

Input  and  output  conditions  are  specified  by  the  user  and  depend  on  the  data,  the  state,  and 
the  instance  of  the  software  module.  Temporal  constraints  can  also  be  checked  as  part  of  input  and 
output  conditions.  The  evaluation  of  conditions  can  trigger  executions,  correlate  input  and  output 
parameters,  check  validity  of  states,  etc.  Note  that  the  evaluation  of  conditions  require  hardware 
resources  and  the  state  of  the  EU.  If  an  error  is  detected,  the  appropriate  EU  is  notified,  which  is 
depicted  by  the  lateral  arrows  in  Figure  1. 

6.2.2  Application  Representation 

An  application  program  is  characterized  by  a  directed  acyclic  graph  called  the  Elemental  Unit 
Graph  (EUG),  augmented  with  timing  constraints  and  operational  relations.  A  vertex  in  the  EUG 
is  an  elemental  unit  and  arcs  represent  control  flow. 

A  task  is  a  unit  of  execution  in  our  resource  model.  A  task  executes  using  the  resources  of 
a  MEARG  and  a  set  of  passive  resources.  In  the  system  view,  tasks  communicate  using  message 
passing  at  the  end  of  the  task.  A  module  which  requires  more  than  one  mearg  must  be  decomposed 
into  sub-modules  to  conform  to  the  resource  model.  Optimization  techniques  like  buffering  may  be 
used  to  reduce  overheads  introduced  by  such  decomposition. 

The  timing  constraint  of  a  task  is  represented  as  a  3-tuple,  (r,c,  d),  where  r  is  its  release  time, 
c  its  execution  time  and  d  its  deadline.  Time  constraints  may  be  defined  in  terms  of  absolute  times 
or  relative  to  the  time  constraints  of  other  tasks.  The  time  constraints  of  tasks  may  be  derived 
from  the  application  time  constraints. 

Operational  relations  are  used  to  represent  synchronization  between  tasks.  Examples  of  such 
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relations  are  precedence  and  mutual  exclusion.  Another  relation  called  tndiit.',tbthty  tei|Uir<^  tin- 
preservation  of  execution  state.  For  instance  in  the  Figure  d,  an  indivisibility  relation  may  be 
defined  between  Aj  and  A2,  and  another  between  .4>  and  A  t. 

Communication  between  EUs  is  represented  by  an  EU  which  requires  commuiacation  nardja 
and  other  resources  for  execution.  These  communication  KUs  are  auiuniaiicaJly  generate.!  H.  t 
erogeneity  is  easily  supported  by  performing  data  translation  to  a  common  network  repi<  Hrui.ttii.n 

(31- 

mod  ulc  A 

RPC  (B) 

[e 

SEND  (C) 

{‘) 

Figure  3:  (a)  User  software  modulo.s  (b)  EU  .structure  (c)  FUCJ 

6.3  Application  Life-Cycle 

The  life-cycle  of  an  application  program  can  be  divided  into  development  and  ojKnitwnal  phases. 
The  application  development  process  goes  through  several  pha.ses  before  it  is  ready  for  operation. 
In  accordance  with  principle  2,  we  try  to  accomplish  as  much  as  possible  in  each  phase. 

1.  Development  Phase:  This  phase  is  broken  down  into  three  stages,  namely  de.sign,  com¬ 
pilation,  and  integration.  The  result  of  this  pha.se  is  an  executable  appliralion  program, 
represented  by  an  EUG,  and  which  is  ready  to  be  submitted  for  c.xccution. 

•  Design.  This  stage  is  the  starting  point  of  the  development  of  an  application  during 
which  the  overall  design  is  carried  out.  The  activities  during  this  stage  include  require¬ 
ments  specification,  conceptual  design  and  detailed  design, 

•  Compilation.  The  software  modules  are  created  at  this  stage  with  the  interface  speci¬ 
fications.  Tools  are  available  at  this  stage  to  decompose  a  software  module  {Figure  3a) 
into  sub-modules  such  that  each  sub-module  can  be  treated  as  an  EU  (Figure  3b).  The 
static  resource  requirements  for  each  EU  are  extracted  at  this  stage.  Figure  3  shows  an 
example  of  such  a  transformation.  The  input  and  output  constraints  for  each  EU  may 
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be  generated  autoniaticaily  from  the  interface  specijicatioii.  1  he  KU  stnuture  of  the 
modules  is  used  during  the  integration  stage. 

•  Integration.  In  the  integration  stage  inodufes  are  inten  rirsiK  i  ted  u>  Unm  a  nrufra;!; 
An  EUG  is  generated  from  the  interconnections  specified  aiul  the  i  t'  ‘irui  'ure  gen 
crated  during  the  compilation  stage,  as  shown  in  Figure  dc.  'I'he  reiriaining  r<  smme 
requirements  for  the  application  are  identified  and  recorded  with  the  applu  atisn'i,  m;<  !^ 
as  the  generation  of  communication  tasks  between  El’s,  in  order  to  verify  that  tiie  i.n 
terconnections  are  valid,  syntactic  type  checking  on  the  interface  spei  ifu  ati<..r;;,  (,if  id's 
is  carried  out  at  the  time  of  integration. 

2.  Operational  Phase:  The  operational  phase  consists  of  resource  allocaiiou  and  e.ver  ution 
For  hard-real  time  applications,  we  require  that  resource  allocation  be  done  priMj-  to  exec  ution. 

•  Resource  allocation.  When  an  application  jirograin  is  submitted  for  execution,  the 
user  may  specify  the  time  constraints  by  identifying  the  nlta^c  tivu  and  dtadluu .  for 
periodic  applications,  the  period  and  termination  condition  must  be  [irovideii.  An  e.\e 
cuting  program  is  called  a  job.  The  allocation  and  reservation  of  resources  to  task.'-  in 
a  job  is  performed  at  this  stage.  Also,  various  kinds  of  synchronization  constraints  are 
satisfied  intrinsically  in  this  stage  [2l|. 

•  Execution.  During  this  stage  the  opci.o-ng  system  perform.s  dispatching,  message 
passing  and  reservation  enforcement.  Previous  stages  prepare  the  application  for  thi; 
stage,  such  that  the  overheads  are  minimal.  The  dispatcher  need  only  examine  tlie 
calendar  and  dispatch  the  tasks  whose  start  time  has  arrived.  In  addition,  we  can 
guarantee  the  absence  of  deadlocks,  since  there  is  no  waiting. 

6.4  Resource  Allocation 

One  main  problem  we  need  to  address  is  that  of  creating  the  calendars  for  meargs  and  passive 
resources  so  that  tasks  are  executed  within  their  time  constraints.  While  a  task  requests  a  single 
MEARG,  there  may  be  several  MEARGs  capable  of  meeting  this  requirement.  In  general,  this  leads  to 
a  multiple  resource  management  problem  which  has  been  shown  to  be  rather  complex[6j.  Grouping 
resources  in  meargs  reduces  the  complexity  somewhat  but  does  not  eliminate  it.  In  a  distributed 
environment  the  .MEARGs  may  be  at  different  nodes,  and  each  node  may  contain  several  meargs. 
To  allow  efficient  allocation  of  the  MEArgs  and  passive  resources,  we  create  calendars  in  throe 
phases:  global  allocation,  local  allocation,  and  local  scheduling. 
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Global  Allocators  (GAs)  use  system-wide  resource  usage  information  to  decide  the  node  in  which 
a  set  of  tasks  is  to  execute.  The  Local  Allocator  (LA)  carries  out  the  allocation  of  the  re.vou!C(-.s 
needed  to  execute  a  task,  by  selecting  one  among  the  several  eligible  .mkakos  and  some  pas.',!ve 
resources.  The  LA  is  also  responsible  for  coordinating  the  scheduling  of  resources  to  various  tasks 
of  an  application  so  that  their  operational  relations  can  be  maintained.  The  actual  interval  of 
execution  of  a  task  is  determined  by  the  Local  Scheduler  (LS),  which  makest  the  necessary  eraries  in 
the  calendar  of  a  resource  (mearg  or  passive  resource).  The  start  and  end  times  of  the  executioD 
of  a  task  i  are  denoted  by  5,  and  e,,  respectively^.  Also,  there  is  one  logical  LS  for  each  calendar. 

Let  us  consider  the  interaction  between  local  allocators  and  local  schedulers.  Consider  a  simple 
situation  where  tasks  ii  and  t^  require  MEARGs  mi  and  mj,  respectively.  Both  these  tasks  are 
submitted  for  scheduling  in  the  same  multiprocessor  node  and  assume  that  ti  sends  some  data  to 
t2  through  a  common  shared  passive  resource  R  (i.c.,  the  shared  buffers  must  be  allocated  from 
the  beginning  of  tj  to  the  end  of  <2)-  The  LS  for  R  is  called  LSR.  The  interaction  between  the 
allocators  and  schedulers  is  described  below. 

1.  LA  sends  allocation  requests  to  LSs  of  mj  and  nin  for  scheduling  of  t)  and  hi. 

2.  LSs  respond  with  exact  start  times  for  e.xeculion. 

'■).  LA  sends  an  allocation  request  to  LSR  for  duration  [••!(,. eijj. 

4.  I.SR.  responds. 

0.  On  success,  LA  sends  commit  to  LSs. 

Note:  If  llie  request  is  rejected,  the  LA  or  the  GA  may  try  allocation  at  some  other  resource. 

A  limitation  to  this  approach,  however,  is  that  the  LSR  has  no  flexibility  in  the  sclieduling  of 
the  passive  resource.  Furthermore,  the  calendars  can  only  be  modified  by  negotiations  Lctv.-een 
the  LA,  LSR  and  LSs.  This  costly  negotiation  and  lack  of  flexibility,  however,  can  be  remedied  by 
management  of  the  passive  resources,  as  follows.  In  addition  to  responding  with  the  start  times  in 
step  2  above,  the  LSs  also  provide  the  forward  and  backward  slacks  associated  with  each  task.  The 
forward  and  backward  slacks  of  task  i  (fi  and  6,)  are  the  amount  of  time  a  task  can  be  postponed 
or  advanced  without  violating  the  timing  constraints  of  tasks  in  a  calendar.  The  LA  sends  the 
scheduled  times  plus  the  slack  times  to  the  LSR.  The  LSR  schedules  the  passive  resource  for  the 
interval  [sj,  -  +  62],  such  that  0  <  <  bt^,0  <  62  <  The  LSR  sends  61  and  62  to  L.4, 

which  passes  it  to  the  LSs  along  with  the  commit  responses.  This  scheme  allows  LS  the  flexibility 
to  move  tasks  without  the  overhead  needed  to  consult  with  the  LSR  as  subsequent  tasks  arrive  at 
the  LS. 

®In  this  paper  we  consider  only  non-preemplive  scheduling,  such  that  e,  =  .s,  -|-c,. 
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We  have  taken  the  approach  of  a  tiered  allocation/scheduling,  combined  with  our  resource 
model,  to  allow  for  the  scheduling  of  multiple  MEaRGs  and  passive  resources  for  tasks,  j'wo  ad¬ 
vantages  can  be  identified  with  this  approach.  One  is  that  this  scheme  allows  for  concurrcm  v  in 
scheduling  of  individual  MEargs.  The  L.4  is  capable  of  initiating  multiple  local  schedtik-rs  ton 
currently  without  waiting  for  replies.  The  second  advantage  is  modularity,  wdiich  allows  testing  of 
different  policies  at  each  phase. 

6,5  Fault  Tolerance 

The  critical  nature  of  many  real-time  applications  requires  resiliency  to  faults.  An  important 
aspect  of  the  MARUTl  system  is  support  for  fault  tolerance.  Our  main  goal  is,  given  a  user-specified 
resiliency  degree  and  a  program,  to  produce  a  fault  tolerant  program  that  will  be  mapped  to  the 
resources  during  the  allocation  phase.  The  resulting  job  should  tolerate  faults  to  the  level  requested 
by  the  user  during  its  execution. 

6.5.1  Fault  Model 

We  consider  an  EU  as  the  unit  of  failure  of  an  application.  In  this  model  the  input  and  output 
conditions  function  as  the  fault  detection  mechanism.  Once  a  fault  has  been  detected,  a  fault 
handling  EU  may  be  invoked.  Such  EU  may  invoke  other  EUs  to  perform  recovery  and  reporting 
(Figure  4).  The  fault  handling  policies  in  these  elemental  units  can  be  specified  by  the  user  or  be  a 
system-defined  default,  such  as  aborting  the  computation  and  sending  a  message  to  the  operator's 
console.  For  each  elemental  unit,  fault  handlers  may  have  different  criticalities  and  execute  in 
different  time  domains.  For  example,  some  may  execute  in  real  time  (for  which  resource  re.ser\ation 
must  be  performed)  and  some  others  on  a  resource  and  time  a’  ailability  basis. 


Figure  4:  (a)  EUG;  (b)  User  fault  handlers;  (c)  Default  fault  handlers 

For  fault  tolerance  purposes,  we  assume  that  it  is  possible  to  group  meargs  into  fault-independent 
partitions^  i.e.,  a  fault  in  one  partition  does  not  cause  a  fault  in  another  partition.  This  fault  inde¬ 
pendence  assumption  is  valid  at  a  hardware  level,  taking  into  account  issues  such  as  independent 
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power  sources  and  different  architectures  for  components.  At  the  software  level ,  independence 
requires  that  the  replicated  versions  be  of  different  designs  and  iniplementations[2j, 

6.5.2  Approach 

Fault  tolerance  for  real-time  systems  is  not  only  complex  from  the  point  of  view  of  functional 
correctness,  but  also  due  to  the  timing  constraints  and  resource  requirements.  In  that  sense,  the 
scheme  adopted  to  tolerate  faults  in  such  systems  must  take  these  factors  into  consideration,  as 
well  as  the  timing  requirements  of  the  application  as  a  whole  and  the  user-defined  resiliency  degree. 

Due  to  the  real-time  constraints,  we  use  an  active  modular  redundancy  approach  to  construct 
a  resilient  application,  in  a  user-transparent  way.  For  each  application  submitted  to  the  system, 
the  user  specifies  the  required  resiliency  degree  rd.  We  must  ensure  that  the  number  of  replicas 
executing  the  same  application  is  enough  to  tolerate  up  to  rd  faults.  The  actual  number  of  replicas 
rd+l  depends  on  the  type  of  faults,  the  detection  scheme,  and  the  system  architecture.  A  discussion 
on  the  different  values  of  I  is  beyond  the  scope  of  this  paper  (see  [13,  11]  for  more  details),  and  in 
the  remainder  of  our  discussion  we  consider  /  to  be  equal  to  1. 

We  do  not  use  roll-back  mechanisms  such  as  checkpointing  [4],  since  time  is  a  critical  resource 
that  should  be  accounted  for.  Roll-back  type  of  methods  are  applicable  to  systems  where  time  is 
not  a  critical  issue  [§]. 

6.5.3  Resilient  Elemental  Unit  Graphs 

In  Section  G.2.2,  we  described  how  to  construct  elemental  unit  graphs.  In  this  section  we  describe 
how  to  transform  an  EUG  into  a  resilient  EUG  (REUG).  This  process  is  carried  out  during  the 
integration  stage  of  the  development.  The  main  idea  is  to  replicate  the  EUs  to  make  the  application 
resilient. 

The  REUG  can  be  built  with  global  or  local  redundancy  (Figure  5).  Global  redundancy  is  based 
on  replication  of  the  EUG  as  a  whole,  where  the  e.xecution  of  each  replicated  EUGs  is  treated 
independently.  On  the  other  hand,  local  redundancy  replicates  specific  EUs  and  can  be  divided 
into  total  or  partial  redundancy.  Partial  redundancy  replicates  only  a  subset  of  the  EUs,  while 
total  redundancy  replicates  each  and  every  EU  in  the  EUG.  Such  flexibility  (different  types  of 
redundancy)  is  needed  because  the  EUs  in  an  application  might  have  different  probabilities  of  fault 
and  different  criticality. 

The  software  component  of  each  replicated  EU  need  not  use  the  same  algorithm  design.  They 
may  encompass  algorithmic  alternatives  to  achieve  design  independence.  Each  EU  replica  is  placed 
in  a  different  partition,  to  enforce  the  fault  independence  at  the  hardware  level.  The  user-defined 
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Figure  5:  (a)  EUG,  and  REUG  by  (b)  global,  (c)  total,  and  (d)  partial  redundancy 


resiliency  degree  is  guaranteed  since  there  is  enough  redundancy  to  carry  out  the  application  despite 
the  presence  of  failures. 

In  order  to  manage  these  EU  replicas  and  the  communication  among  them,  we  associate  auxil¬ 
iary  structures,  namely  forkers  and  joiners,  with  each  EU  in  the  REUG.  These  auxiliary  structures 
are  inserted  in  the  REUG  automatically  and  in  a  user-transparent  manner.  Consider  a  message  m 
from  task  A  to  task  B.  The  forker  for  A  sends  m  to  all  the  replicas  of  B.  The  joiner  of  B  collects 
the  messages  from  A  and  its  replicas,  and  filters  them  using  a  message  selection  algorithm  to  elect 
the  correct  message.  It  then  sends  the  correct  message  to  the  destination  B. 

A  consequence  of  the  use  of  joiners  is  that  some  fault  elimination  takes  place.  When  the  joiner 
selects  one  of  the  incoming  messages,  and  when  selection  results  in  the  correct  message,  all  incorrect 
messages  get  eliminated.  Note  that  the  joiners  may  also  recognize  the  errors  in  the  messages  and 
may  be  used  to  trigger  corrective  actions. 


6.5.^  Discussion 

The  approach  introduced  above  provides  fault  tolerance  in  a  uniform  way,  while  maintaining  flexi¬ 
bility  through  user-specified  resiliency  degrees  and  message  selection  algorithms.  Also,  the  creation 
of  an  REUG  and  its  allocation  across  partitions  of  the  distributed  system  is  transparent  to  the 
user,  unless  custom  fault  toler-.nce  is  used  to  create  the  REUG  (partial  redundancy).  In  the  partial 
redundancy  case,  the  user  needs  only  to  specify  the  critical  EUs,  and  the  system  automatically 
generates  the  REUG.  In  other  words,  changing  the  fault  tolerance  requirements  does  not  affect  the 
functional  behavior  of  the  program. 

The  allocation  of  REUGs  is  carried  out  in  a  distributed  manner  [12].  The  allocation  scheme  also 
enforces  that  fault  detection  and  recovery  are  consistent  with  the  real-time  constraints,  which  makes 
the  scheme  suitable  for  real-time  applications.  In  addition,  our  apr  oach  does  not  require  special 


623 


hardware.  Treatment  of  faults  is  an  intrinsic  characteristic  to  the  application  design,  due  u>  the 
EU  fault  handling  structures  defined  at  the  application  design  level.  This  facilitates  rnaiiitejiance 
of  the  system,  and  allows  for  changing  the  levels  of  criticality  of  an  program  (botii  t^  nipnrai  and 
functional)  after  design  and  integration,  but  before  execution  time. 

Our  model  also  allows  users  to  achieve  a  balance  between  the  resource  overhead  inlroduceti  1)> 
processing  fault-tolerant  EUs  and  the  probability  of  faults.  In  other  words,  the  user  can  decide  on 
the  cost/reliability  trade-off  oased  on  the  requirements  of  applications  by  specifying  the  resiliency 
degree  for  parts  of  the  EUG  and  the  type  of  redundancy  used.  Furthermore,  It  is  not  difficult  to 
see  that  the  construction  of  the  REUGs  are  semantic  preserving,  with  respect  to  the  FUGs  defined 
by  the  users. 

7  Concluding  Remarks 

In  this  paper  we  presented  the  basic  structure  of  maruti  and  the  philosopliy  used  in  its  design. 
Our  experience  to  date  confirms  our  belief  that  the  comprehensive  solutions  can  oidy  be  genf’rated 
by  addressing  all  the  requirements  the  system  must  meet  at  the  system  design  time  a. id  developing 
integrated  solutions.  For  example,  in  maruti  the  fault  handling  as  well  as  time  handling  is  car¬ 
ried  out  uniformly.  The  fault  tolerance  capabilities  are  available  in  a  user  transparent  way,  while 
permitting  the  user  to  enhance  the  default  capabilities  if  so  desired. 

The  feasibility  of  the  approach  taken  has  been  demonstrated  through  an  implementation  of 
MARUTI,  at  the  University  of  Maryland.  The  time  driven  and  distributed  nature  of  this  design  has 
been  established  and  tested  successfully.  The  fault  liaiidling  capabilities  of  the  design  have  been 
validated  for  homogeneous  architectures. 

An  interesting  aspect  of  the  design  of  maruti  is  the  way  it  handles  the  distributed  operations 
in  homogeneous  as  well  as  heterogeneous  environments.  In  addition,  due  to  its  modular  design,  it 
lends  itself  naturally  as  a  testbed  for  evaluation  of  the  resource  management  policies.  VVe  are  in 
the  process  of  studying  various  policies  applicable  to  the  management  of  multiple  resources,  as  well 
as  demonstrating  the  heterogeneous  operation  of  the  system. 
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ABSTRACT 

"Case  studies  where  written  up,  are  rarely  presented  in  forms 

which  reveal . the  full  range  of  difficulties  encountered."  [1]  . 

The  intent  of  this  paper  is  to  document  "difficulties  encountered" 
during  the  past  seven  years  of  modelling  for  analysis  of  a 
submarine  combat  system  architecture.  Lessons  learned,  along  with 
resulting  recommendations,  will  be  presented.  We  have  divided  the 
modelling  process  into  five  phases:  Data  Acquisition  and  Archive, 
Model  Design,  Model  Implementation,  Model  Exercise,  and  Results 
Analysis  [2]  .  The  paper  is  organized  to  present  the  lessons 
learned  as  they  apply  to  each  of  these  phases.  Products  and 
methodologies  which  are  under  development  or  are  planned  as  a 
result  of  these  lessons  will  be  listed  under  Products  and 
Methodologies . 


INTRODUCTION 

The  Naval  Undersea  Warfare  Center  (NUWC)  began  using  modelling 
to  assess  the  performance  of  the  architecture  of  a  large,  complex 
submarine  combat  system  at  the  very  earliest  stage  of  the  program, 
that  is,  virtually  while  the  system  was  still  in  the  requirements 
definition  phase.  MYSTECH  ASSOCIATES,  INC.  (MYSTECH)  joined  the 
NUWC  architecture  modelling  team  late  in  1986  as  the  competing 
contractors  were  preparing  their  proposals.  The  modelling  team  has 
continued  to  monitor  the  development  of  the  system  throughout  its 
full  life-cycle:  from  the  proposal  evaluation,  through  Full  Scale 
Engineering  Development  (FSED) ,  and  into  the  test  and  evaluation 
phase . 

This  was  an  ambitious  undertaking  considering  the  magnitude  of 
the  new  system,  and  the  fact  that  the  architecture  team  was 
functionally  and  geographically  separated.  Initially,  there  were 
those  who  were  hesitant  and  skeptical  about  the  value  of  modelling 
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for  architecture  performance  evaluation.  However,  as  the  program 
developed,  and  results  were  achieved,  the  impact  of  the  modelling 
on  the  contractors'  proposed  designs  became  evident.  Those  who  had 
been  cautious  about  accepting  the  approach  began  to  recognize  the 
contribution  of  modelling  to  the  system  development.  Animation  of 
the  models  provided  visual  demonstration  which  facilitated 
management's  comprehension  of  our  effort.  The  use  of  operational 
scenarios  as  model  drivers  translated  well  to  operationally 
oriented  program  managers. 

The  utility  of  modelling  was  demonstrated  to  other  programs  as 
well.  During  1989,  NUWC  decided  to  apply  the  technology  to  the 
analysis  of  a  surface  ship  ASW  system.  MYSTECH  is  now  supporting 
NUWC  on  that  prograim. 

The  methodologies  used  to  develop  the  models  and  evaluate  the 
system  architecture  are  well  documented  [3,4,5,61.  This  paper 
deals  with  some  of  the  difficulties  encountered  over  the  past  seven 
years,  and  the  lessons  learned.  The  sections  of  the  paper  coincide 
with  the  five  phases  into  which  we  have  divided  the  modelling 
process:  Data  Acquisition  and  Archive,  Model  Design,  Model 
Implementation,  Model  Exercise,  and  Results  Analysis.  Each  section 
presents  a  brief  statement  of  the  lesson  (in  bold  type) ,  a 
discussion  of  the  difficulties  experienced,  and  a  list  of 
recommendations  suggested  by  each  lesson.  At  the  end  of  the 
paper,  methodologies  and  products  which  result  from  the  lessons  are 
discussed. 


DATA  ACQUISITION  AND  ARCHIVE 

A.  Data  acquisition  is  the  most  time  consuming  portion  and  has 
the  largest  learning  curve  in  the  modelling  process.  "Much  time 
and  money  can  be  consumed  by  what  is  known  as  'getting  started.'" 
[7]  . 


Given  that  the  system  under  investigation  was  new  system 
development,  and  especially  during  the  proposal  preparation  phase, 
the  majority  of  the  quantitative  information  and  data  was  unknown. 
Whatever  data  was  available  was  typically  from  "hearsay"  and  that 
information  was  frequently  changing.  Due  to  the  "fuzziness"  of 
data  and  the  constant  fluctuations  in  levels  of  information, 
everything  had  to  be  verified  and  validated  before  it  could  be 
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used.  Once  the  contract  was  awarded,  the  data  became  more 
substantial  but  it  continued  to  change  frequently.  Additionally, 
most  of  the  information  was  still  at  a  very  high  level  of 
abstraction,  which  is  to  be  expected  during  the  early  stages  of  any 
system  development . 

Early  on  in  the  program,  when  quantitative  data  was  not  yet 
available,  we  learned  to  use  our  Best  Engineering  Judgement  (BEJ) , 
as  described  in  [3] .  This  process  enables  the  judicious 
application  of  knowledge  gained  on  previous,  or  like,  systems 
developments  to  the  current  system  under  evaluation.  All  the  data 
developed  using  BEJ  needed  to  be  verified.  The  prime  directive 
was:  "Do  not  use  data  you  do  not  understand"  [8] .  The  search  for 
information  to  verify  these  assumptions  was  very  time  consuming 
since  there  was  no  central  repository. 

General  combat  systems  knowledge,  as  well  as  specific  data 
concerning  elements  within  the  proposed  competing  architectures, 
was  gathered  from  as  many  areas  as  possible.  Sources  included 
engineers  and  documentation  from  other  related  NUWC  programs. 
Acoustic,  database  management,  and  simulation  experts  provided  a 
rich  source  of  information.  Product  literature  on  operating 
systems  and  database  management  systems,  specifically  those  under 
consideration  in  the  proposals,  provided  baseline  data  for  the 
simulations.  As  much  information  as  possible  was  gathered  from 
whatever  documentation  was  received  from  the  competing  contractors, 
and  subsequently  from  the  prime  development  contractor  after  the 
award . 

Recommendations : 

1.  Use  of  "best  engineering  judgement"  to  fill  in  the  holes 
is  a  useful  and  productive  process  when  applied  with  extensive 
record  keeping  and  verification.  This  allows  the  project  to  move 
forward  while  the  painstaking  process  of  data  gathering  is  going 
on. 


2.  An  extensive  system-specific  document  library  was 
developed  which  contained  all  documentation  received  during 
competition  and  then  during  FSED.  As  each  document  or  memorandum 
was  received,  it  was  catalogued  and  stored,  and  a  notice  was 
distributed  to  all  team  members. 
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3.  The  ability  to  reuse  documents/knowledge/data  from  one 
program  to  the  next  saves  valuable  time  and  effort.  When  talking 
about  the  reuse  of  problem  solving  information,  Rumbut  e.mphasizes 
the  value  of  "reusing  specification  data  from  previous  development 
efforts"  [9]  .  Knowledge  gained  from  the  submarine  combat  system 
architecture  evaluation  provided  valuable  support,  and  shortened 
the  start-up  time,  when  the  modelling  process  was  initiated  on  the 
surface  ship  ASW  system.  Specifically,  knowledge  of  the 
communication  network,  methodology  for  designing  the  model,  and  the 
actual  skeleton  of  the  system  model  were  all  re-used. 

4  •  Data  acquisition  must  be  started  as  early  in  the 
modelling  process  as  possible.  As  much  documentation  as  possible 
was  gathered  as  quickly  as  possible  to  begin  development  of  a 
surface  ship  ASW  system  document  library. 

B.  Much  time  can  be  consumed  waiting  for  all  the  data  for  the 
development  of  a  cos^lete  and  accurate  model. 

In  preparation  for  evaluation  of  the  proposed  submarine  combat 
system  designs,  several  NUWC  engineers  were  tasked  to  define  a 
notional  architecture  to  model  nearly  ten  months  before  the  time 
that  the  NUWC/MYSTECH  team  was  assembled.  As  mentioned,  there  was 
very  little  data  available,  especially  at  that  point  in  the 
program.  The  team  finally  did  build  a  skeleton  of  a  system  model, 
nearly  a  full  year  after  the  original  tasking.  Nine  months  later, 
a  memo  documenting  model  results  was  written  containing  the 
statement  "Results  are  considered  preliminary  because  of  the  many 
assumptions  and  'best  engineering  judgements'  incorporated  in  the 
model"  [10]  .  Two  years  into  the  project,  50%  of  the  raessages  in  the 
Interface  Requirements  Specification  (IRS)  were  still  incomplete. 

At  times,  the  system  model  contained  varying  levels  of  detail 
as  system  component  data  became  available  at  different  times.  For 
instance,  there  was  detailed  information  about  the  communication 
network  protocols,  but  very  little  was  known  about  the  workstations 
and  data  manager.  Jenevein  notes:  "The  model  must  deal  well  with 

different  levels  of  design  abstraction  .  components  which  are 

not  to  be  immediately  implemented  can  be  left  in  abstract  form  and 
interfaced  to  the  more  resolved  component  models"  [11] . 
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Recommendation: 


1.  Models  were  developed  in  a  top  down  fashion,  starting 
with  simple,  high  level  models  and  using  BEJ  where  needed.  All 
assumptions  and  engineering  judgements  were  documented  and  included 
with  model  results.  As  data  became  available,  corrections  and 
amplifications  were  made  and  the  models  became  more  detailed.  In 
this  manner,  we  were  able  to  evaluate  the  proposed  system 
architectures  with  a  degree  of  confidence,  especially  during  the 
proposal  preparation  phase,  despite  the  lack  of  information. 

C.  Careful  records  of  assiuzqptlons  and  data  must  be  maintained  for 
validation  and  verification  (V&V)  .  Without  traceed>llity,  chaos  can 
result. 

The  assumptions  and  best  engineering  judgements  used  to 
develop  our  models  were  replaced  with  actual  data  as  it  became 
available.  As  the  system  design  evolved,  data  such  as  system 
topologies,  bus  rates,  message  rates  and  lengths,  maximum  message 
and  packet  lengths,  and  communications  network  queue  priorities, 
changed  frequently.  Team  leaders  cautioned:  "Don't  throw  anything 
away  -  things  are  constantly  changing"  [8] . 

Two  models  were  developed  simultaneously,  one  for  each  of  the 
two  competing  architecture  designs.  Thus,  it  became  imperative 
that  information  for  the  two  models,  sometimes  very  similar,  be 
kept  separate  and  distinct.  Our  goal  was  to  evaluate  the 
architecture  of  the  proposed  systems.  Model  results  were  one  of 
the  means  to  achieving  this  goal.  For  results  to  be  credible,  we 
had  to  be  able  to  justify  the  data  used  to  produce  them. 

The  models  shifted  rapidly  from  proposal  evaluation  models  to 
FSED  models.  Some  of  the  models  were  over  four  years  old  by  that 
time.  We  needed  to  be  able  to  retrieve  and  justify  any  data  which 
was  used  in  the  original  models  or  model  designs.  By  the  time  FSED 
models  were  being  developed,  enough  information  had  been  archived, 
and  enough  models  had  been  built,  to  enable  a  structured  V&V 
process . 

Our  model  results  were  compared  with  the  results  of  the 
development  contractor's  own  modelling  efforts,  so  all  assumptions, 
engineering  judgements,  and  data  used  to  develop  the  models  had  to 
be  documented  for  identification  and  verification.  Comparison  of 
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results  from  the  two  parallel  efforts  provided  additional  V&V,  and 
encouraged  dialogue  between  the  two  organizations. 

A  database  was  maintained  containing  thousands  of  messages 
used  to  drive  the  models.  We  created  message  groupings  by 
combining  messages  by  source  and  destination,  and  then  by  rate,  and 
executed  the  models  with  these  groupings.  Despite  the  careful 
recording  of  which  messages  were  used  in  which  model  and  for  which 
run,  there  was  a  problem  tracing  the  origination  of  some  messages 
and  their  rates  and  sizes.  We  actually  had  an  internal  lessons 
learned  meeting  at  the  time  to  review  what  had  gone  wrong.  The 
finding  was  that  there  still  was  not  enough  record  keeping  and  data 
archival . 

Recommendations : 

1.  The  importance  of  a  catalogued  document  library  was 
emphasized.  This  facilitated  the  retrieval  of  data  sources,  and 
provided  traceability. 

2.  An  extensive  database  of  the  messages,  operational  events 
from  the  scenario,  system  software  modules,  and  processor 
characteristics  was  developed  and  maintained. 

3 .  An  engineering  notebook  was  developed  and  maintained  for 
each  model.  Meeting  notes,  memos,  model  designs,  assumptions, 
BEJs,  questions,  issues,  results  of  all  runs,  and  analysis  of  the 
results  were  recorded. 

4.  Naming  conventions  were  established  for  modules, 
messages,  and  hardware  devices  for  use  within  the  models. 

5.  A  Standard  Template  for  Software  Development  specified 
preamble  standards  for  documentation  of  each  module  [12] .  Items 
included  in  this  template  were: 

a.  routine  name  and  purpose 

b.  author 

c.  release  date 

d.  modification  dates 

e.  inputs  and  outputs 

f.  global  and  local  program  variables 

g.  subroutines  that  called  this  routine 
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h.  subroutines  called  by  this  routine 

i.  change  list,  including  date  and  description 

j.  special  testing  considerations. 

6.  Several  standards  were  developed  to  "ensure  that  all 

models  produced . form  a  consistent  set  of  tools"  (13].  These 

standards  included  module  execution  time  estimates  (14] ,  message 
rate  groupings,  channel  characterizations,  and  network  delay  and 
overhead  rate  calculations. 

7.  The  models  were  designed  to  make  extensive  use  of  input 
files,  which  made  the  data  easy  to  identify  and  retrieve.  Input 
files  were  developed  for  software  processes,  hardware  devices, 
messages,  system  topologies,  and  scenarios. 


MODEL  DESIGN 

A.  First,  know  and  understand  the  model  goals.  That  is,  what 
information  are  you  trying  to  get  from  the  model? 

The  architecture  lead  engineer  admonished  us  to  "wear  a  system 
engineering  hat"  and  to  remember  what  our  requirements  were,  that 
the  "goal  was  to  evaluate  the  architecture,  not  develop  a  model" 
[8]  ,  This  was  especially  important  because  we  were  evaluating  a 
system  throughout  its  development  process,  and  the  overall  program 
issues  changed  as  the  program  matured. 

We  got  immersed  in  a  modelling  problem  in  which  the  number  of 
modules  (M)  increased  tremendously  based  on  the  number  of  nodes  (N) 
in  the  topology  (M=N(3+2N)).  Considerable  manhours  were  spent 
looking  for  possible  solutions.  Finally,  we  asked  ourselves  "what 
are  we  trying  to  accomplish  in  this  modelling  effort?  what 
questions  do  we  want  to  answer?"  (8]. 

Recommendations : 

1.  A  set  of  metrics  for  system  architecture  performance 
evaluation  were  defined.  These  came  to  be  known  as  "the  big  eight" 
and  were  incorporated  into  the  goals  for  every  model: 

a.  resource  utilization 

b .  system  throughput 
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c.  message  response  time 

d.  data  latency 

e.  data  senescence 

f.  system  expandability 

g.  system  reconfigurability 

h.  identification  of  system  'choke  points'. 

These  measures  were  useful  in  evaluating  system  behavior  under 
various  reconfiguration  conditions,  and  to  study  failure  modes  and 
possible  recovery  actions.  As  the  evolution  of  the  system  brought 
about  different  critical  system  issues,  these  eight  metrics 
remained  the  overall  modelling  objectives. 

2.  To  prevent  the  models  from  "taking  on  a  life  of  their 
own",  thus  becoming  unresponsive  to  program  objectives,  a 
structured  review  cycle  was  developed  to  be  followed  for  each  of 
the  models.  This  consisted  of  several  structured  steps  which  all 
members  of  the  modelling  team  were  expected  to  follow; 

a.  initial  strategy  meeting 

b.  requirements/goals  review 

c.  high  level  design  review 

d.  detailed  design  reviews 

e.  model  implementation 

f.  results  review 

g.  conclusions  and  recommendations  report. 

Throughout  this  cycle,  the  models'  goals  were  kept  visible  to 
ensure  that  the  models  were  attaining  them. 

B.  A  model  must  be  tailored  to  a  specific  area  of  Interest:  too 
broad  is  as  undesirable  as  too  much  detail. 

The  initial  plan  for  the  system  model  was  to  include  every 
element  of  the  system  architecture.  This  would  have  produced  a 
model  which  was  too  big,  too  complex,  and  virtually  unmanageable. 
"We  had  a  concern  that  our  model  was  too  complex  but  that  a  simpler 
model  would  reduce  the  sensitivity."  [8]. 

The  modelling  team  was  directed  to  develop  a  smaller,  detailed 
model  for  each  individual  subsystem  within  the  combat  system;  at 
the  same  time  still  develop  a  larger  system  model.  It  was  not 
immediately  clear  how  to  separate  the  detailed  models  from  the 
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system  model .  Even  though  a  system  topology  diagram  was  used  to 
determine  where  the  dividing  lines  should  be,  there  was  still  much 
confusion.  It  was  difficult  to  determine  how  to  partition  the 
system  elements.  It  also  became  unclear  as  to  what  the  "system 
model"  actually  was  [15]  . 

Recommendations ; 

1.  Looking  at  the  system  from  both  directions  at  once  (i.e., 
from  the  top  down  and  from  the  bottom  up)  gave  valuable  insight 
into  the  important  details.  This  produced  a  reliable  set  of 
partitions.  For  excimple; 

a)  Implementation  of  the  partitioning  guidelines 
defined  in  [3]  provided  some  structure  which  eliminated 
the  confusion.  Using  a  top  down  approach,  a  high  level 
system  model  was  defined  with  black  boxes  for  the 
subsystems.  In  addition,  a  Register  Insertion  Ring 
Network  (RIRN)  model  was  developed  using  black  boxes  for 
each  node.  Later,  a  detailed  node  model  was  developed 
consisting  of  three  segments,  one  for  each  processor 
within  the  node. 

b)  Reversing  the  process,  a  bottom  up  approach  was  also 
used.  For  each  competing  contractor's  architecture, 
detailed  models  were  developed  of  the  data  manager, 
system  network,  system  executive  (operating  system) , 
signal  processing,  and  workstation.  These  detailed 
models  were  then  abstracted  and  represented  as  black 
boxes  in  the  system  model. 

2.  Definition  of  interfaces  between  models  is  critical. 
Clearly  defined  interfaces  between  separate  models  must  be  ensured 
if  they  are  to  be  connected  to  a  system  model  at  some  point.  One 
important  guideline  that  was  imposed  was  "not  to  hook  things  that 
are  not  really  elements  of  the  real  system  so  they  don't  corrupt 
the  data  flow"  [8]  .  Statistical  distributions  were  used  to 
represent  each  detailed  model  and  that  element's  impact  on  the 
system.  Use  of  a  common  set  of  scenarios  to  drive  all  the  models 
provides  a  consistent  foundation  between  the  detailed  models  and 
the  system  model. 
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C.  Must  ensure  that  the  model  is  easy  to  modify  and  maintain. 

The  assumptions  used  in  the  models  changed  constantly, 
therefore  had  to  be  easily  visible  and  identifiable.  During  the 
very  early  stages  of  the  program,  not  only  did  topologies  change 
frequently,  but  the  actual  hardware  elements  of  the  system  changed 
as  proposed  designs  evolved.  In  fact,  the  specific  database 
management  system  and  operating  system  were  not  chosen  until  well 
into  the  FSED  phase  of  the  program. 

Since  the  models  were  used  to  perform  parametric  studies, 
parameters  to  be  varied  had  to  be  easily  accessed.  During  proposal 
evaluation,  rapid  prototyping  was  performed  to  support  NUWC's 
requests  for  certain  types  of  simulation  runs  where  results  were 
needed  within  24  hours. 

During  that  time,  the  simulation  language  of  choice  was  CACI's 
NETWORK  II. 5®  [16],  which  was  very  easy  to  use.  However,  one  of 
its  limitations  was  that  every  hardware  device  in  the  model  had  to 
have  its  own  separate  hardware  device  definition  and  software 
module.  Since  the  system  model  had  dozens  of  nodes  which  were  all 
physically  and  functionally  identical,  a  single  hardware 
definition/software  module  pair  should  have  been  sufficient  to 
describe  all  the  nodes.  Instead  we  had  to  define  a  separate  pair 
for  each  node,  with  the  only  difference  being  their  names.  Every 
time  a  node  changed,  which  happened  frequently,  numerous  identical 
modifications  had  to  be  made.  Typographical  errors  were  prevalent 
due  to  the  similarities  within  the  names. 

Recommendations ; 

1.  The  simulation  language  of  choice  changed  from  NETWORK 
II. 5  to  CACI's  SIMSCRIPT  II. 5®  [17],  which  provided  for  greater 
model  complexity  while  allowing  ease  of  modification;  both  of 
which  negated  the  cost  of  greater  model  development  effort. 


®NETWORK  II. 5  is  a  registered  trademark  and  service  mark  of  CACI,  Inc.  -  Federal. 

®SIMSCRIPT  II. 5  is  a  registered  trademark  and  service  mark  of  CACI,  Inc.  - 
Federal . 
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2.  Extensive  use  of  input  files  was  made  to  contain  much  of 
the  frequently  changing  data.  There  were  files  to  represent 
several  different  model  parameters: 

a.  message  files  contained  sources,  destinations, 
rates  and  lengths 

b.  hardware  files  contained  processor  speeds  and  bus 
speeds 

c.  software  files  contained  processing  times  and  SLOC 
sizes 

d.  operational  scenario  files  contained  operator 
actions  and  tactical  event  descriptions 

e.  topology  files  contained  architecture  topology 
descriptions 

f.  application  to  device  files  contained  detailed 
software  to  hardware  mappings. 

D.  Model  design  must  provide  for  use  of  pareimetric  loading  as 
well  as  realistic  operational  scenarios. 

The  system  model  was  used  to  perform  parametric  studies  with 
varying  message  rates,  lengths  and  iteration  periods.  These 
studies  were  used  to  measure  response  times  over  different  network 
paths.  Messages  were  generated  periodically,  either  with  some 
statistical  distribution  or  actual  system  intervals.  We  called 
these  simulations  "time  driven".  In  some  of  these  studies,  the 
model  was  also  loaded  with  background  "noise"  in  which  statistical 
distributions  were  used  to  simulate  network  traffic  and  device 
processing  delays.  Specific  test  messages  were  then  injected  on 
top  of  this  "noise"  to  measure  critical  paths  and  actual  message 
delays . 

Realistic  operational  scenarios  were  constructed  to  provide 
insight  into  the  behavior  of  the  system  architecture  under 
different  levels  of  operational  stress.  The  system  model  was 
driven  by  several  stress  scripts  in  which  actual  system  messages 
were  transferred  over  the  network  and  software  processing  was 
incurred  as  realistically  as  possible.  The  events  simulated 
included  aperiodic  operator  actions  such  as  console  button  pushes, 
which  incur  a  specific  system  response.  Most  of  the  operator 
actions  also  caused  a  chain  of  events  to  occur  in  which  messages 
were  generated  asynchronously.  We  called  these  simulations  "event 
driven".  In  addition  to  the  scenario  events,  some  of  these 
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simulations  also  modelled  the  overhead  incurred  by  such  background 
processing  as  database  management,  PM/FL,  the  operating  system  and 
resource  management. 

Recommendations : 

1.  Construction  of  input  files  to  define  the  varying 
parameters  for  the  parametric  studies  is  useful.  In  this  way  the 
code  will  not  require  modification  for  each  study. 

2.  The  messages  to  be  used  for  time  driven  simulations 
should  be  contained  in  input  files.  Scenario  inputs  for  event 
driven  simulations  should  be  contained  in  input  files  as  well. 
These  should  include  not  only  the  initiating  operator  actions,  but 
also  the  resulting  chains  of  events  (i.e.  processing  times  and 
messages) . 

3.  Use  operational  scenarios  to  exercise  the  models.  This 
provides  not  only  a  common  source  of  data  to  be  used  by  all  the 
models  and  model  developers  but  further  reinforces  the  model's 
ability  to  accurately  evaluate  the  system  for  its  intended  purpose. 
In  addition,  because  these  scenarios  are  developed  using  combat 
system  terminology,  models  can  be  more  readily  explained  to  other 
system  engineers  involved  in  the  program  and  more  readily 
appreciated  by  management.  It  is  easier  to  describe  the 
architecture's  behavior  in  the  context  of  something  such  as  system 
IPL  rather  than  "getting  a  message  from  A  to  B  via  nodes  X,  Y,  Z". 

4 .  Models  should  be  designed  to  support  event  driven 
simulations  and  their  asynchronous  operational  events  and  messages. 
This  provides  insight  into  the  behavior  of  the  system  under  various 
realistic  stress  conditions. 


MODEL.  IMPLEMENTATION 

A.  May  need  to  use  different  modelling  techniques  and  tools 
depending  upon  the  elements  to  be  modelled. 

Early  in  the  program,  before  eithc-j'  competitor  had  settled  on 
an  acoustic  front  end  architecture,  NUWC  chose  to  model  a 
representative  cabinet  associated  with  a  deployed  sonar  system  to 
explore  the  dynamics  of  that  data  processing.  This  was  a  good 
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first  model  since  there  was  exist ina  data  and  results  could  be 
reasonably  well  predicted.  The  Riodel  results  agreed  with 
spreadsheet  calculations,  confirming  belief  that  this  processing  is 
pipelined  and  did  not  require  dynamic  modeilinq. 

One  model  of  the  entire  combat  system  architecture  would  have 
been  too  large  if  all  the  details  were  included.  Even  with  careful 
application  of  methods  defined  in  [3],  the  system  model  becam..- 
increasingly  large  and  complex  for  NE'TWORK  11.6.  It  was  difficult 
to  modify  and  replicate  modules,  and  simulations  took  too  lorra. 

Recommendations : 

1.  Tool  selection  should  be  tailored  to  the  characteristics 
of  the  elements  to  be  modelled  and  ’  ...odd  goals.  Dyna.mic  models 
are  often  not  necessary,  nor  uie  they  always  productive,  when  a 
static  model  of  the  ele;ric;nt  can  be  used.  Indeed,  early  in  the* 
process  spreadsheet  t'^pe  models  might  be  preferable. 

2.  Statistical  distributions  can  be  calculated,  using  a 
package  such  as  UNIFIT*''  (181,  to  represent  the  detailed  models  of 
system  components  in  the  system  model . 

3.  Be  prepared  to  change  simulation  languages  as  models 
mature,  or  goals  change.  For  example,  SIMSCRIPT  II.  5  replaced 
NETWORK  II. 5  as  the  models  became  increasingly  complex. 

B.  Be  able  to  support  portability  of  models  across  various 
platforms. 

The  NUWC/MYSTECH  modelling  team  was  geographically  dispersed 
at  various  sites  in  Rhode  Island  and  Connecticut.  Models  were 
developed  and  run  at  each  of  these  sites,  lit  on  different 
machines . 

Some  of  the  simulations,  being  run  on  a  VAX  11/780  and  VAX 
11/785,  took  several  days  to  complete.  In  an  effort  to  reduce 
execution  time  and  increase  performance,  the  models  were  ported  to 
SUN  workstations  which  produced  faster  run  times.  Eventually,  two 
ol  the  models  were  ported  to  a  CRAY  where  run  times  were  reduced  to 
hours  instead  of  days . 

*UNIFIT  is  a  registered  trademark  of  Select  Software  Services. 
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Animation  was  used  as  a  debugging  aid,  to  enhance  our 
understanding  of  how  the  architecture  behaved  and  to  support 
demonstrations.  To  meet  that  need,  the  models  were  ported  to  Sun 
workstations 

Recommendations : 

1.  The  benefits  of  portability  became  apparent  very  quickly 
as  models  were  developed  on  different  machines  at  different  sites. 
The  ability  to  use  the  SUNs  and  a  CRAY  to  shorten  simulation  run 
times  was  crucial  for  providing  rapid  turnaround  of  model  results. 
The  use  of  SUNs,  and  PCs  if  possible,  to  support  animation  provides 
a  valuable  way  to  instruct  model  users  and  program  managers  who  are 
otherwise  unfamiliar  with  the  models  or  modelling  tool. 
Furthermore,  no  programming  changes  should  be  required  to  move  from 
one  platform  to  another.  Otherwise,  there  will  be  additional 
configuration  management  impacts.  The  simulation  language  must  be 
chosen  carefully  at  the  outset  to  ensure  ease  of  portability  later. 


MODEL  EXERCISE 

A.  Must  have  good  configuration  management  in  order  to  replicate 
runs . 


We  performed  numerous  runs  with  each  model  due  to  erroneous  or 
changing  data.  The  models  were  used  to  perform  parametric  studies 
and  rapid  prototyping.  They  were  executed  under  a  variety  of 
different  operational  scenarios.  Despite  the  engineering 
notebooks,  naming  conventions,  input  files  and  modelling  standards, 
it  was  still  a  difficult  process  to  keep  track  of  each  run:  its 
purpose,  unique  parameters,  run  time  and  duration. 

Recommendations : 

1.  For  parametric  studies,  SIMSCRIPT  11.5's  reset  statement 
was  used  to  perform  multiple  runs  with  one  batch  job.  This 
statement  reinitializes  all  statistical  counters  relative  to  the 
listed  variables.  The  entire  parametric  study  could  be  performed 
with  one  simulation  by  resetting  the  selected  parameter  and  all  the 
statistics  for  each  run.  Selected  simulation  tools  should  have 
this  capability. 
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2.  A  directory  tree  structure  was  developed  for  organizing 
each  model,  its  runs,  and  its  unique  input  files.  All  common  input 
files  were  stored  in  one  directory.  Full  path  names  were  specified 
within  the  models. 

3.  Output  files  were  designed  such  that  all  initialization 
parameters  for  a  given  simulation,  including  all  input  file  names, 
were  echoed  to  the  output  files  at  the  beginning  of  the  simulation. 
This  design  was  incorporated  into  the  surface  ship  ASW  system 
models  as  well . 


RESULTS  ANALYSIS 


A.  Thorough  V&V  of  the  model  and  its  results  must  occur 
throughout  the  life  cycle  of  the  model. 

As  would  be  expected,  a  thorough  analysis  of  simulation 
results  occurred  after  the  initial  runs  to  V&V  the  system  model . 
However,  as  the  model  was  undergoing  numerous,  however  minor, 
modifications  and  being  used  for  rapid  prototype  runs,  not  all  the 
modifications  were  thoroughly  verified  for  their  impact  on  the 
model.  One  set  of  results  showed  high  utilizations  on  certain 
processors  but  upon  further  research,  we  determined  that  there  were 
errors  in  some  message  iteration  periods  and  other  messages  were 
being  transferred  from  the  wrong  source.  In  another  instance, 
large  queues  ended  up  being  the  result  of  typographical  errors  in 
source  and  destination  names.  Once  we  animated  the  model,  we 
re  there  was  an  error  in  the  way  that  the  message  protocol 
sequ  was  implemented. 

Recommendations 

1.  The  structured  review  cycle  described  under  Model  Design 
can  be  applied.  Although  it  may  be  impractical  to  apply  to  minor 
model  modifications,  ensure  that  all  major  design  changes  and,  of 
course,  any  new  models  follow  the  review  cycle. 

2.  Maintain  engineering  notebooks  and  ensure  all  model 
designs,  design  changes  and  model  load  files  are  recorded  in  the 
notebooks. 
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3.  Divide  the  modelling  team  into  three  groups:  scenario 
developers,  model  developers  and  model  analysts.  The  scenario 
developers  design  all  scenarios  to  be  used  to  drive  the  models  and 
define  the  messages  and  processing  events  which  should  be  included 
in  each  operational  event  sequence.  The  model  developers  design 
and  develop  all  the  models.  The  model  analysts  analyze  all  the 
simulation  results  for  model  V&V,  and  for  performance  evaluation  of 
the  system  being  modelled.  In  this  way,  the  scenario  developers 
become  responsible  for  the  accuracy  of  the  messages  and  processing 
events  which  are  to  be  loaded  onto  the  model.  The  model  developers 
become  responsible  for  implementing  an  accurate  system  simulation. 
The  analysts  become  responsible  for  ensuring  the  accuracy  of  the 
results  and  determining  the  causes  of  any  variations  or  anomalies. 

B.  Steady  state  and  transients  may  cause  problems  in  results  If 
not  well  understood. 

The  model  runs  had  to  be  of  sufficient  duration  to  ensure  that 
steady  state  was  reached.  Steady  state  problems  occurred  when  the 
system  model  was  loaded  such  that  all  periodic  messages  were 
transmitted  starting  at  time  zero.  This  caused  a  heavy  initial 
load  on  the  system,  so  we  had  to  wait  until  the  effects  of  this  had 
settled  out  and  the  messages  began  their  individual  periodic 
generation  cycles. 

The  analysis  of  particular  model  runs  revealed  large  message 
queues  at  some  nodes.  It  seemed  that  the  model  had  uncovered  a 
system  bottleneck.  Further  investigation  determined  that  the  cause 
of  the  large  queues  was  several  long  messages  being  transferred 
from  those  nodes.  Upon  additional  analysis,  it  was  discovered  that 
these  messages  did  not  travel  over  the  network  and  should  not  be 
included  in  system  network  evaluations.  A  thorough  investigation 
of  the  transient  large  queues  deterred  us  from  perhaps  erroneously 
reporting  the  existence  of  a  system  bottleneck. 

Recommendations : 

1.  Use  of  periodic  snapshots  during  the  model  runs  supports 
assessment  of  when  steady  state  has  been  reached. 

2.  The  modelling  language's  reset  option  should  be  used  to 
reset  the  statistics  during  the  simulations  after  the  initial  heavy 
load  of  messages  has  occurred. 
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3.  To  alleviate  some  of  the  initial  heavy  loading  problem, 
develop  start  time  rules  for  when  to  start  periodic  messages. 

4.  Dividing  the  modelling  team  into  subgroups  as  described 
under  Results  Analysis,  facilitates  the  ability  to  uncover  problems 
observed  in  the  results.  The  separation  of  the  team  into  scenario 
developers,  model  developers  and  results  analysts  ensures  distinct 
allocation  of  responsibilities.  This  supports  the  "retracing  of 
steps",  from  statistical  output  to  model  input  or  model  design, 
needed  to  understand  transient  results. 

C.  Must  use  caution  when  comparing  results  of  different  models. 

When  the  project  began,  there  were  three  separate  modelling 
efforts  being  conducted  in  parallel:  each  competing  contractor 
developed  models  of  their  own  proposed  architecture;  NUWC  also 
developed  a  separate  model  for  each  competing  architecture.  This 
meant  four  different  models.  The  modelling  languages  used  for  each 
were  not  necessarily  the  same.  The  problem  became:  how  do  we 
compare  the  models'  results  and  make  sure  that  the  comparisons  are 
valid  and  fair? 

An  operational  scenario  consistent  with  doctrine  and  the 
intended  use  of  the  system  was  developed.  A  set  of  stress  scripts 
which  depicted  system  load  at  various  times  during  a  mission  were 
extracted  from  the  scenario.  These  scripts  provided  a  common 
baseline  for  loading  and  running  all  of  the  models.  This  enabled 
the  modelling  team  to  evaluate  results  from  different  models  of  the 
same  architectures  using  the  same  set  of  metrics. 

After  source  selection,  two  parallel  architecture  modelling 
efforts  were  conducted  by  NUWC  and  the  development  contractor. 
Results  were  compared.  In  some  instances,  the  results  conflicted 
but  it  turned  out  to  be  caused  by  either  the  use  of  different 
assumptions  or  a  difference  in  model  implementation.  A  conflict  in 
resource  utilizations  was  found  to  be  caused  by  the  addition  of 
messages  which  did  not  in  fact  travel  over  the  network,  and  the 
implementation  of  multicasting  communications  within  the  NUWC 
model.  Higher  transfer  delay  times  were  due  to  an  erroneous 
implementation  of  a  data  packet  transfer  through  a  forwarding  node. 
Once  changes  were  made,  the  results  were  consistent  between  the 
models . 
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Recommendations : 


1.  The  initial  choice  would  seem  to  be  to  use  the  same 
modelling  language.  However,  the  goal  of  developing  independent 
models  of  the  architectures  is  to  verify  that  the  contractor  is  in 
fact  designing  the  system  as  proposed  and  required.  This  can  be 
verified  via  verification  and  validation  of  the  contractor's 
models.  If  the  assumptions,  data,  designs,  and  implementations  are 
accurate,  then  results  from  separate  models,  even  in  different 
languages,  should  be  consistent.  A  danger  of  using  the  same 
language  is  that  one  of  the  modelling  teams  can  easily  end  up 
duplicating  the  other  tecim's  models  without  even  being  aware  this 
is  happening.  Hence  an  independent  evaluation  of  the  system  does 
not  occur. 

2.  Using  a  common  set  of  realistic  operational  scenarios  to 
drive  the  models  provides  a  common  baseline  for  comparison  between 
the  government's  predictions  and  the  prime  development 
contractor's.  This  forms  the  basis  for  a  common  understanding  of 
what  is  expected,  a  consistent  set  of  results,  and  a  forum  in  which 
to  discuss  resulting  system  issues,  rather  than  dwelling  on  the 
"ones  and  zeros". 


GENERAL  LESSONS 

There  are  additional  lessons,  which  did  not  result  from 
difficulties  encountered,  but  became  emphasized  over  the  course  of 
the  program.  The  NUWC/MYSTECH  team  consisted  of  people  with 
varying  levels  of  experience  and  education.  There  were  project 
managers,  systems  and  software  engineers,  operational  analysts,  and 
junior  programmers.  Years  of  experience  ranged  from  over  twenty 
years  to  new  hires  fresh  out  of  college.  There  were  experienced 
modelers,  and  some  with  no  experience  at  all.  Despite  this 
diversity,  the  effort  of  designing  and  developing  the  models  taught 
two  very  valuable  lessons: 

A.  Even  if  the  models  had  not  been  built,  the  amount  of 
information  gathered  on  the  new  system  design,  and  the  insight 
gained  from  dissecting  the  architecture  into  its  various  components 
for  close  scrutiny  provided  valuable  support  to  NUWC's  Technical 
Direction  Agent  responsibilities.  Model  development  is  a  good 
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educational  tool  to  support  learning  about  the  intricacies  of  a 
system  quickly. 

B.  The  assignment  of  different  tasks  to  different  people 
focussed  their  efforts.  The  definition  of  model  goals  provided  a 
direction  to  follow  while  uncovering  the  necessary  data.  The 
interfaces  between  the  detailed  models  and  the  system  model,  and 
the  structured  review  cycle,  ensured  the  flow  of  information  among 
the  team  members  so  that  everyone  shared  the  same  general  level  of 
system  understanding.  The  value  of  the  TEAM  was  emphasized.  The 
sharing  of  information  across  functional  and  organizational 
boundaries  allowed  NUWC  to  examine  system  interfaces  and  identify 
critical  areas  of  risk  which  may  not  otherwise  have  been 
discovered.  Important  dialogues  were  initiated  which  will  continue 
for  the  next  generation  of  systems  development. 


PRODUCTS  AND  METHODOLOGIES 


Examination  of  the  lessons  learned  over  the  past  several  years 
indicates  several  voids  in  modelling  and  analysis  activities  which 
we  are  now  beginning  to  fill.  Several  products  and  methodologies 
are  being  developed  which  capitalize  on  the  lessons  [19] .  This 
section  describes  those  products  and  methodologies. 

•  Develop  an  information  base  to  store  all  the  data  as  it 
is  acquired.  Organize  the  data  so  that  it  can  easily  be 
accessed  for  other  projects  and  models. 

•  Add  to  the  information  base  all  system  performance 
requirements  and  model  data  to  facilitate  configuration 
management . 

•  Develop  a  library  of  reusable  simulation  objects  that  can 
be  expanded  as  new  models  are  developed.  Include  within 
the  library  analytical  and  simulation  tools  and 
languages,  and  in  particular,  include  an  object-oriented 
programming  language. 

•  Add  to  the  library  any  simulation  and  analytical  models 
which  can  be  reused  on  different  modelling  tasks. 
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•  Develop  a  Concept  Assessment  Tool  to  aid  the  modeler  in 
organizing  and  learning  top  level  system  requirements. 

•  Design  the  Concept  Assessment  Tool  such  that  it  becomes 
an  automated  method  for  composing  a  set  of  measures  of 
effectiveness  from  system  performance  requirements. 
Tying  evaluation  criteria  to  specific  system  requirements 
allows  validation  of  model  goals. 

•  Develop  a  run  time  control  facility  to  assist  in  setting 
up  model  runs  and  submitting  and  controlling  the  runs. 
The  facility  would  also  support  the  creation  of  a  run 
library  containing  all  the  elements  required  to  replicate 
a  run. 

•  Utilize  the  run  time  control  facility  to  assist  the 
modeler  in  preparing  the  runs,  and  to  provide 
configuration  management  of  the  model  runs. 

•  Develop  an  analysis/V&V  facility  to  assist  the  analyst  in 
evaluating  model  results.  Utilize  animation  capabilities 
to  assist  in  model  V&V. 


SUMMARY 

Difficulties  encountered  over  the  past  seven  years  of 
submarine  combat  system  architecture  modelling,  and  the  lessons 
learned  along  with  their  recommended  solutions,  have  been  explored. 
Several  products  and  methodologies  for  avoiding  the  same 
difficulties  (or  at  least  being  better  prepared  to  meet  them)  in 
the  future  have  been  discussed.  Future  plans  for  application  of 
these  products  and  methods  are  being  developed.  These  plans 
include; 


Development  of  an  integrated  simulation  environment  which 
would  include  facilities  for  scenario  generation,  model 
development,  and  results  analysis.  The  simulation 
environment  should  provide  all  the  tools  needed  for 
accurate  and  thorough  assessment  of  an  architecture 
through  simulation. 
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•  Development  of  a  graphical  interface  to  the  simulation 
environment  which  would  include  a  model  user  facility  to 
allow  program  managers  and  system  designers  to  perform 
analysis  and  evaluation  without  needing  to  know  the 
details  of  the  models  themselves. 
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